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WHERE DO YOU 


WANT TO GO 


TODAY 


Good question 

Microsoft. 











But what about tomorrow? 




Ask Windows users where they want to go today, and 
their answer Ls likely to be this: Windows 95. It is, after all, a 
major advance in the state of Windows 
RfjSt computing. And it does, finally 
bring some of the innovations 
pioneered by Apple in 1984 to 
the PC desktop of 1995. 

Thatk great, today. But 
where, one has to ask, is 
desktop computing going 
tomorrow? And is moving 
to Windows 95 really the right way to get there? 



Me other PC mnufatfutm are Mil 
struggling to get CDs to load Macintosh 
users can create their own multimedia, 
work in 3-D, surf the Internet and see 
what's real about virtual reality tkay. 


The future of computing. 

In a word, itk multimedia. Microsoft and Intel say itk 
the future. So do we. The difference is, we deliver dial future 
today. To see what we mean, simply turn a Power Mac'on. 
When you do, you can not only gpt down to work (or play) 
with the CD-ROM of your choice,you can also start using 3-D 
graphics. You can talk to your Mac? And have it recognize 
your command. You can videoconference across continents. 
You can even dive in to virtual reality.* All at the touch of a 
few keys and the click of a mouse. 

The power to do it. 

Tb do all this, you need power. 

And the best way to get it is with a 
Power Mac. In recent tests, for 
example, the RISC-based Power 
Macintosh" 9500 outperformed a 



Because Power Mac computers 



120 MHz Pentium-processor-based actwe ma&i mdmtwt mttfy. 


PC by 63 % on average. When running scientific j 
and technical apps, the performance advantage 
jumped to 80%. And for graphics, the 
Power Mac was more than twice as fast" 

Th« easiest way to get there, computer m the mid 

designed^fern the start as 
a seamiest integration of 

Of course, all the raw power in the ixmbam mid software, 
world is worthless if you can’t use it.Thatk why every new Mac 
includes an innovative help system that doesn’t just answer 
your questions, but shows you what to do, where to click and 
what to type to get filings done. And why we make it so easy 
to create Internet connections, install new software and set 
up entire new networks from scratch. 

More choices than ever. 

Today, every new Macintosh 1 can 
read and write DOS and Windows disks. 
But our compatibility goes further than 
’SSiSXT that-The Power Macintosh 6100/66 
1MS Compatible, fa example, mb 

inakmg it the most amifiat- 

ible computer you can find, thousands 01 DOS ailti WinflOWS appll- 
cations, in addition to thousands of programs for Macintosh. 
And our new Power Mac systems accept standard PCI cards. 

In the future, Apple innovations will further break down 
the barriers between cross-platform collaboration. Distinc¬ 
tions between the platforms themselves will diminish. Even 
the boundaries between applications will blur. 

All of which will add up, once again, to the most important 
kind of power of all. The power to be your best? 

Tb learn more about Macintosh power today, and tomor¬ 
row, visit us on the internet today at http://wwwapple.com. 
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Why Settle For GIF, TIFF and JPEG 
len Yen Gan lave AD This... 




dia 


AVI 


CLP 

DIB 

GIF 1 

EPS 

FLC 

IFF 


15 Mul 

Audio Video Interleave ILBM 

Microsoft Bitmap JPEG 

Windows Clipboard p { JPCX 
Windows Device Independent Bitmap PNG 
Graphics Interchandfi Format HAS 

Encapsulated PostScript”* RGB 

Autodesk™ FLIC ><L TIFF 
Amiga Interchange Tile Format TARGA 


Its 

Amiga Interchange File Format 
True Color Compressed 
Zsoft 

The New GIF Replacement Format 

Sun Raster 

Silicon Graphics 

Tagged Image File Format 

Truevision 


A dd extensive multimedia support lo your Windows application with just u few hours of 
development. The (£pmfnan Format Image Libraries span the entire Windows product line 
l including WindowlpLl, Windows 95 and all Windows NT versions! The libraries use a 
consistent interface across Blplatfdrcns which eases cross-platform development. By simplifying 
coding efforts these lihrarJ yii / MihzUmtiallv reduce your development lime. The libraries were 
developed for intensive ijie in commercial applications, contain extensive error checking and have 
been field proven over several Pars. 


Pricing for the Libraries: 


Windows 3.1 
Windows95 
Intel NT 
PowerPC NT 
MIPS NT 
ALPHA NT 


Download 
and do 



$750.00 
$ 1000.00 
$ 1000.00 
$ 2000.00 
$ 2000.00 
$3500.00 

sample library, 
from our Web 
d at 

Wncs.htm y 


J 


Try these royalty-free Libraries Risk-Free 
with our 30 Day Money-Back Guarantee! 


Quality problems with AccuSoft™ libraries? 
Compatibility problems with LEAD™ libraries? 
Take $300. 00 QFF_ with our Risk- Free Competitive 
Upgrade! 


Conversion Arti: 
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North Coast Software, Inc. 

P.O. Box 459 • Barrington, NH 03825 
Tel. 1 603) 664-7871 • Fax (603) 664-7872 • Email: 4386449@mcimail.com 

y>i[i TVadcrnarLs mt Copyright by then ixspectjvc owners; 
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a single disk and online. To order the disk, send $14,95 
(California residents add sales tax) to Dr. Debt's Journal, 

411 Borel Ave., San Mateo, CA 94402, call 415-655-4100 
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EDITORIAL 

Rebel Alliance 


elieving they needed a competitive edge to combat a common foe, a small 
band of large companies, including IBM and Apple, formed an alliance more 
than four years ago. The mission of die alliance was to create a new computer 
capable of running each vendor’s software, while providing advantages over 
that of the w Winter (Windows running on Intel-based platforms) empire. The alliance 
decided that, if it could ixise its approach on a RISC architecture that offered price and 
performance advantages over die empire, it might have a fighting chance. And while 
Intel was protecting its chip's architecture through patents, the alliance agreed to create 
a common instruction set architecture (ISA), allowing other chip makers to create 
processors that will run the same code. 

Four years later, Ixith Apple and IBM have launched hardware platforms based on the 
Mulorola/IBM PowerPC chip. Now, several operating systems are either in development 
or already on the market, including AIX, Solaris, Linux, OS/2, Windows NT, and Apple’s 
Copland. Additionally, developer tools are beginning to show up in numbers. The big 
question now is whether corporate developers will embrace the new technology. 

Developers must consider a dizzying array of options. First, there was the original 
PowerPC 601 microprocessor, manufactured by IBM but sold by IBM and Motorola. The 
601 was designed to bridge the gap between PowerPC anti the POWER chip used in 
IBM’s RS/6000 workstations. Thus, it uses the older POWER instruction set, which has 
since been eliminated from the specification, Ihe MPC603 and MPC603e included 
power-management functionality tor notebook computers. The 64-bit MPC620 chip is 
still in development However, features and options such as instruction and data caches 
all vary with the different implementations. For instance, the MPC602 (which is targeted 
at consumer electronics and embedded applications) has dual 4-KB instruction and data 
caches. The MPC604, on the other bind, contains dual 16-KB caches, while the MPC620 
will have separate 32-KB instruction and data caches. 

The alliance had originally intended to deliver on its performance promises by now. 
Although Motorola contends that its MPC604 chip outperforms the Pentium, the 
estimated 15-30 percent improvement in performance still falls short of the 2d increase 
originally promised. Moreover, Intel’s upcoming P6 microprocessor is expected to show 
performance comparable to that of the MPC604 chip, 

IBM finally Ixgun shipping its PowerPC-based machines in June of this year. 
However, the entry-level models running Windows NT cost consumers some $3700, 
well above the $2500 for similar Intel-based machines running the same operating 

system.Further, IBM announced nearly a year ago that it was delaying the launch of its 

PowerPC computers so that it could make ready its OS/2 for PowerPC Apparently, the 
port of the operating system is taking longer than expected, and the company could 
wait no longer to deliver its Power Series computers. 

Meanwhile, the first Linux kernel for PowerPC is up and running on Motorola's 
PowerPC VME 1603. But when asked about a similar port to PowerMac, project 
coordinator Joseph Brothers indicates that Apple cannot come up with the necessary 
programming specifications For the PowerMac's NuBus, nor can it provide necessary 
information on devices, memory maps, or interrupt hardware. Motorola has tried for 
more than a year to obtain the necessary specifications from Apple. (Incidentally, the 
Linux kernel is available via anonymous ftp from liber.stanford.edu/pub/lfnuxppcJ 
Despite the short-term glitches in cost, performance, and roll-outs, the PowerPC is 
impressive, and most major operating systems will likely run on PowerPC-based 
platforms in the future. 

Still, at tills stage, the fate of the PowerPC alliance is in the hands of developers such 
as yourself. May the source be with you. 


Dr Dobbs 
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Make your Macintosh an 
Internet server and a UNIX 

workstation. 


| 


/•— • 






Mach Ten is a Berkeley 
BSD UNIX that runs on 
the Classic to the Power 
Macintosh, including 
PowerBooks and Duos! 

So in addition to all of the 
easy-to-use applications 
that make Macintosh one 
of the most personable 
computers around, you get 
aMACH-based UNIX with 
pre-emptive multitasking. 

Mach Ten ’s strength lies in the way 
it extends the Macintosh Operating 


System with UNIX net¬ 
working and software 
development tools. The 
Macintosh/UNIX integra¬ 
tion is so strong that you 
can even use Mac programs 
& utilities on UNIX data, 
and UNIX programs & 
utilities on Mac files. 

And Mach Ten ’s full 
internet protocol support 
lets you use your Mac as 
a domain name server, IP router, POP 
mail server, or Web server. 


Arty NFS server can be used to 
store Macintosh fifes. Users can 
access them by double clicking as 
on the local disk. 


The UNIX software development 
system includes the GNU C and 
C++ compilers and libraries to let 
you create new applications or port 
existing ones. The Motif toolkit and 
suite of X clients and X client 
libraries make developing distributed 
applications a breeze. 


Files and directories can be viewed on disk 
using the Macintosh Finder or the more versatile 
UNIX commands. 

And Tenon’s high performance 
X Server lets you use your Macintosh 
or Power Macintosh as an X terminal. 

Join the many satisfied users of 
proven, reliable Mach Ten UNIX, and 
start turning all of your Macs into 
open systems today! 

For more information, or to order 

Call 1-800-6-MACH-10. 


Internet: info@tenon.com 
http: //vvww.tenon.com 
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Access network resources 
using the power of TCP/IP 
Telnet ftp, riogin, or Xtenm 
connections are possible 
using Mach Tsn . 


M r TENON 

//// INTERSYSTEMS 


New Dimensions in Personal 
Workstation Technology 


Tenon Intersystems 1123 Chapala Street Santa Barbara, CA 93101 Tel: 805-963-6983 Fax: 805-962-8202 

©1906 Tenon Interayatema. The Tenon NensysleTis name end Macih lESl are Irademarks of Thtuti Intersysterns. Madrusosh, Qassfc, RoweiBook, Power Macintosh and Duo are registered trademarks of Apple 
Computer, Inc. UNIX is a registered Irademark In Ihe Untied Stales and other oounEries, icensed ttrrcugp X/Qpen Company Limited 





































































































I t's the highly-regarded industry 
standard, used by more developers 
than any other Macintosh develop¬ 
ment system. And now it's been 
totally re-engineered for Power Mac. 

Introducing Symantec C++™ 8.0 
for native Power Mac. 

Not since the original THINK 
C m has anything so dramatically 
boosted your productivity. There s a 
Visual Architect™ to generate GUI 
code instantly An advanced Project 
Manager to handle the largest and 
most complex jobs. Plus native 
Power Mac tools for radically 
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Boost your productivity with our Visual Architect, Advanced Editor. 
Debugger and Project Manager. 


The Industry Standard Just Moved To A Higher Power. 
Symantec C++ 8.0 for power Mac. 


improved performance. 

Draw on the 
Industry Standard. 

With Symantec C++ 8.0, you sim¬ 
ply draw the user interface - including 
windows, dialogs, controls, icons and 
menus* Then the built-in Visual 
Architea generates the code with the 
click of a mouse. Now you can spend 
more time on what really sets your 
application apart - its functionality. 

A Higher Standard 
For Speed. 

The new high-performance com¬ 
piler is dramatically faster than the 
previous version, so you can become 
more productive than ever. 

And for even more power, weVe 
added an Advanced Project Manager. It 


gives you drag and drop so you can eas¬ 
ily add files, Named Option Sets for 
d ranging complex sets of options fast, plus 
support for even the largest applications. 

Theres also a new editor and a 
browser for modifying and navigating 
source files, a new debugger, an 

New! FastC 

AndC ++ Compilers 

for both Mac and Power Mac 

New; Visual Architect 

for the Power Mac 

New; Think Class Library 

2,0 jbr Power Mac 

New 1 . Popup Menus 

sake you right to declarations & headers 

Newi Debugger 

for nested projects and shared libraries 

Newi Project Manager 

for multiple targets and 

hierarchical support 

Newi Split Pane Editor 

for syntax highlighting and 
mm formatting. 


SYMANTEC. 


Price good in lL S. only. Far more information in Canada, call L-&0Q-&67-36G 1 , ext, 5513. In Australia, call 2-879-6577- In Europe, cal] 31-71-35311 1- Symantec C++, Visual Architect, 
THINK Class Library arc trademarks of Symantec Corporation, All other trademarks are die property of their respect ive holders. © 1995 Symantec Corporation, All rights reserved, 
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incremental linker, THINK Class 
Library'" 2.0 and much more. 

A Double Standard Fdr 
68k And Power Mac. 

Symantec C++ 8.0 is the first 
native Mac/Power Mac development 
system to support C++ templates, 
nested classes and multiple inheri¬ 
tance, as well as ANSI C. Phis 8.0 
includes version 7.0 for 68K support. 

This double standard gives you 
everything you need for both 68K 
and Power Mac development. 


order at a special Upgrade price of $1 49, 

I caU 1-800^28^4777 Ext, 9A25. 1 

m. 

| Be su>r to ask about the Symantec 
Developers Advantage Program 
i for premium support and regif- , 

| tar updates, Or visit your local C++ 
store. 
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Porting to the 
PowerMac 


A pple's new generation of Macs is 
based on the Motorola PowerPC 
RISC processor. The PowerMac 
offers extremely high performance 
for applications that are compiled 
and linked for it. However, to preserve the 
investment users may have in existing 
software, the PowerMac supports legacy 
68K applications. This support is accom¬ 
plished through software emulation of die 
68K instruction set and operating-system 
support for the 68K run-time model, in 
addition, new and old code (as well as 
run-time architectures) can be mixed with¬ 
in an application. Apple developed the 
PowerMac OS this way— some of System 
7.5 is still 68K code. 

In this article. 111 describe the main sim¬ 
ilarities and differences between tire old 
and new OSs and the process of porting 
Macintosh 68K applications to the 
PowerMac. Ill also present an appli¬ 
cation that illustrates using code re¬ 
sources to mix new and old code 
witliin an application. 

68K versus PowerMac 

On a PowerMac, two operating sys¬ 
tems coexist in parallel — the origi¬ 
nal 68K system and the new r Power¬ 
Mac system. They run on top of a 
“nanokernel," which provides the 
low est-level services such as memo¬ 
ry management and interrupt ham 
dling, The magic of coexisting 68K 
and PowerPC software is worked by 
the Mixed Mode Manager. 


Paul is a staff engineer with Syman¬ 
tec's Development Tools Group and 
works on Macintosh and Windows 
development tools. He can he con¬ 
tacted at pkaplan@symantec.com. 


A tale of two 
operating systems 


Paul Kaplan 


Wien an application is launched, the 
Pow erMac OS looks for the special Code 
Fragment Resource, type cfrg, which spec¬ 
ifies a PowerMac application. If a valid cfrg 
resource exists, the application is handed 
to [lie Code Fragment Manager (CFM). This 
subsystem manages the loading and exe¬ 
cution of applications and shared libraries. 
In addition to handling the default load 
format, the CFM allows the use of custom 
loaders. A 68K application lias no cfrg re¬ 


source and is therefore handed to the 68K 
Segment Manager. 

After an application has been launched 
as either 68K or PowerMac, it can switch 
modes while running. To switch modes 
and run unmodified, 68K applications 
call the Mixed Mode Manager implicitly; 
PowerMac applications can call it im¬ 
plicitly or explicitly. 

In order to run 68K applications, the 
PowerMac OS has retained a number of 
components of the 68K OS, In fact, the 
PowerMac toolbox calls are a superset of 
the 68K system. The file system is the 
same, so “well-behaved" applications can 
be ported with little more dian recompil¬ 
ing and linking—‘ the development sys¬ 
tem will take care of run- time details. The 
System 7 MacOS, on the other hand, re¬ 
tains a single address space for all run¬ 
ning applications, and the multi¬ 
tasking model is still cooperative and 
non-preemptive. Future releases of 
die MacOS, beginning wtith System 
8 (code named “Copland”) w ill pro¬ 
vide multiple, virtual address space, 
preemptive multitasking, memory- 
mapped I/O, and object-oriented 
user-interface components. 

The run-time model for Power¬ 
Mac applications is completely new r . 
New, only one code and one data 
segment are required, and the seg¬ 
ment manager is no longer used. 

The code segment has no reloca¬ 
tions, which makes it sharable, and 
ail the relocations are in die data 
segment. Bach application has a 
Table Of Contents (TOC) that serves 
the same function as the 68K H A5 
world" and greatly simplifies access 
to global data. The TOC is created 
by your development system and is 
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transparent to C or C++ code. Also, the 
new OS supports, and depends heavily 
on, shared libraries. In fact, the PowerMac 
toolbox is a shared library. Finally, the ap¬ 
plication file format has been completely 
reorganized. 

Porting Your Application 

Porting to the PowerPC can be as simple 
as recompiling if your source code meets 
the requirements listed in the next few para¬ 
graphs. For example, Symantec C++ 8.0 
automatically converts your existing 7,0 
C68K) project; recompile and link it T and 
it's ready to go. On the other hand, lega¬ 
cy applications that take shortcuts to sys¬ 
tem features will need some porting work. 

The first step in porting any application 
is to ensure that your code runs under 
68K System 7- Such an application should 
use only “32-bit dean 11 addresses. Older 
Mac applications sometimes used the high 
byte of an address for purposes other than 
the address. PowerPC addresses use all 
32 bits. In compiling your code, use ANSI 
C or C++, which will force stronger type 
checking and function prototypes. Also, 
compile with Apple’s Universal Headers, 
which are shipped with your development 
system. Universal Headers are appropri¬ 
ate for both 68K and PowerPC applica¬ 
tions and will make your code portable 
between them. In addition, either rewrite 
inline assembly in C, or place the inline 
code in separate assembler files. IF you 
insisL on keeping the 68K code, it should 
be isolated in a separate code resource. 

Don’t make assumptions about regis¬ 
ters, especially passing parameters, as they 
are all different. And try to use data types 
with 4-byte alignment Although the Pow¬ 
erPC processor allows alignment any¬ 
where, At byte alignment produces more- 
efficient code. I Iowever, if you're writing 
structures to a file, using 4-byte alignment 
can waste disk space. 

Beyond these steps, you can use 
^pragmas to force 68K alignment where 
it is necessary for toolbox routines. Check 
that the alignment is correct when read¬ 
ing data from an existing disk file. Also, 
use ini and long data types. On the Power¬ 
PC, int and long are 32 bits, and short is 
16 bits. The 32-bit integer is the most ef¬ 
ficient data type. 

Use the double data type for floating¬ 
point variables. The PowerPC FPU sup¬ 
ports only the IEEE 4-byte ( float) and 
8-byte {double) floating-point formats. 
Double is more efficient. The 10- and 12- 
byte doubles used on 68K are not sup¬ 
ported by the processor. Long doubles are 
supported with two doubles. (Note that 
the Symantec compiler does not support 
long doubles.) Check all ^pragmas and 
dependencies on ^defines to ensure they 


still have meaning in the new environ¬ 
ment. Do not put data in code. This would 
affect pipelined-instruction performance. 
And if you have Pascal code, convert it to 
C either by hand or with the MPW p2c 
Pascal-to-C converter (available on Ap¬ 
ple’s ETC #17 CD-ROM). 

When porting the system interface por¬ 
tion of your application, you should gen¬ 
erally use system calls instead of accessing 
the hardware directly. In addition, convert 
callbacks to universal procedure pointers. 
These are available in the Universal Head¬ 
ers. if you’re passing a callback procedure's 
address to the operating system, you must 
create a UniversalProcPtr with the New- 
RoutineDesaiptor function (the actual data 
structure that describes the function is 
called a “routine descriptor”). You need 
to use UniwrsalProcPtrs because the OS 
makes no assumption about the callbacks 
architecture. Strictly speaking, routine de¬ 
scriptors are not required for 68K builds 
(they are compiled into addresses), but 
using them will make your code complete¬ 
ly portable between die two environments. 

Another tiling to watch for is direct ac¬ 
cess of low memory. Don't do it! Rather, 
use the LMSetxxx and LMGetxxx calls in 
LowMem.h. Finally, don’t explicitly use 
the 68K run-time model. The 68K run¬ 
time-specific calls are not supported. For 
example, a call to the Segment Manager 
would return with no action. 

Unking Your PowerMac 
Application 

Your linker will create a “fragment,” which 
is the atomic load unit and contains code 
and static data. Fragments are managed 
by the CFM. Most PowerMac applications 
and shared libraries use the Preferred Ex¬ 
ecutable Format (PEF) to house fragments. 
PEF specifies the file header, .segments for 
code and data, import- and export-.symbol 
tables, and relocations. Normally, the ap¬ 
plication resides in the data fork of its file, 
although fragments can be resources as 
well. The linker in your development sys¬ 
tem will handle the details of fragments 
and the PEF. 

Your linker should support the xcoff 
format, which is an extension of the caff 
format found in UNIX, This is important 
because the only stub libraries Apple sup¬ 
plies to link to the toolbox and shared- 
library extensions such as the Drag Man¬ 
ager are in xcoff format. The stubs are 
supplied with your development system. 
The xcolT format can also be used to link 
third-party static libraries and object mod¬ 
ules from a single translation unit. Nor¬ 
mally, the Symantec development envi¬ 
ronment skips the step of writing object 
files; the compiler passes them directly to 
the linker in memory. 


Dividing applications into shared li¬ 
braries will make your code reusable and 
smaller by eliminating redundantly load¬ 
ed code. Your development environment 
will help you create and manage shared 
libraries. 

Under the Application Hood 

As mentioned, the run-time model of an 
application running on the PowerMac OS 
Is quite different from that of the 68K. The 
PowerMac run-time model has one code 
and one data segment, wliich are normally 
loaded in memory. The code segment Is 
read only, which makes it suitable to run 
in ROM, but unsuitable to store writable 
data. Code and data elements may be ex¬ 
ported from die fragment, which means 
their symbols are made public and may 
be linked dynamically. With the Syman¬ 
tec environment, symbols are exported 
with a ^pragma. 

Within the data segment resides the 
TOC, which Is like a personal address 
book. It provides linkage to symbols in¬ 
side and outside the fragment. The TOC 
has linkage to imported routines, import¬ 
ed data, global variables, and the pool (or 
pools) of static variables. When loading 
the application and its shared libraries, 
CFM resolves imported symbols and fills 
in die appropriate TOC entries. T he TOC 
is 64 KB r so tliere is a maximum of 16K 
TOC entries. Your development system 
will warn you of a TOC overflew. 

Applications should have a main() en¬ 
try point and may additionally have user- 
initialization and termination routines. CFM 
will call the main() entry point of an ap¬ 
plication after it is loaded, CFM may also 
call an initialization routine as part of load¬ 
ing the fragment, and it may call a termi¬ 
nation routine when it unloads the frag¬ 
ment. Your development system will help 
you define these entry points, 

Shared-Library Details 

Although common in UNIX, shared li¬ 
braries are probably best known as DLLs 
in Microsoft Windows. Originally, shared 
libraries were available as an add-on to 
older MacOS versions with Apple Shared 
Library Manager (ASLM), but they are now 
a standard feature and are in common use 
on tile PowerMac. Shared libraries are sim¬ 
ilar to applications. The main differences 
are that the file type for a shared library 
is sblb, not APPL, and that there is no 
muini) entry point. Initialization and ter¬ 
mination routines are allowed. 

When the PowerMac system starts up, 
its shared libraries are registered with CFM 
and made available to all calling applica¬ 
tions. Other shared libraries can be load¬ 
ed and called at application startup if spec¬ 
ified in the PEF file, or loaded on request 
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PORTING 

(continued from page 8) 
by the application. Shared libraries can be 
loaded automatically by specifying rhem 
as impon libraries to your development 
environment. Your linker will resolve ex¬ 
ternal symbols to a shared library as 
though they were part of a statically linked 
library. However, the linker knows they've 
been imported and will put them in the 
import list for the appropriate library. As 
CFM loads your application, it will also at¬ 
tempt to load shared libraries specified by 
the application. 

Shared libraries can also be loaded ex¬ 
plicitly with the toolbox call GetDisk- 
Fragmeni In this case, imports should be 
specified as “weak" so the linker won't be 
unhappy with the unresolved references. 
If you load a shared library explicitly, your 
code should he able to handle a failed 
load or an unresolved import (wliich will 
have a null address at run time). Shared 
libraries also have version capabilities. CFM 
cheeks the version number of a shared li¬ 
brary against the version number required 
by the application and fails on bad if it is 
not compatible. Version numbers can be 
specified by your development system. 

Cade Resource Examples 

MacOS System 7 code resources such as 
CDEF and MDEF do not need to be im¬ 
mediately ported to PowerPC, However, 
there are performance penalties for mixed- 
mode switching and for running 68K- 
emulated code. If the performance of a 
code resource is critical, you should com 
vert that resource to a native, or “acceler¬ 
ated/ resource. 

To illustrate lhe process of gradually 
porting to the PowerMac, I’ve included a 
sample project and the required modifi¬ 
cations. Listings One and Two show the 
project source files from a 68K program 
that calls a 68K resource; listings begin on 
page 12. This project is a simple applica¬ 
tion that creates a window, has a standard 
event loop, and calls the mainO routine 
in the code resource to handle the Update 
Fvent. The examples don't use any C++ 
features, although they were compiled 
with die Symantec C++ compiler. The Inil- 
ToolboxStufft ) and MomeDoumProci ) 
rouLines are standard Mac idioms and 
aren't shown. Also, the error checking that 
would be in commercial-grade code Is 
omitted 

The first modification is the same pro¬ 
ject ported to the PowerMac. Note that the 
code-resource routine Ls slid 68K and there¬ 
fore unchanged. The main-project routine 
(see Listing Three) requires a lew changes 
to call the code resource through the 
Mixed Mode Manager. Note the use of the 
Toolbox rou tine CallUn iversalProc( ) f 
which has a mmrgs parameter list, and 


the two required parameters, FroclnfoType 
and UniversalProcPtr FroclnfoType has 
been initialized to describe the interface 
of the routine so that CallUniversalProcO 
will use the parameters correctly. 

The second modification illustrates the 
changes required to port the resource to 
the PowerMac, This time, the main project 
routine lias not changed because it was 
ported in the first modification. Listing Four 
illustrates the accelerated resource code. 

There are new calls to cplusrsrciniK ) 

and_ cpiusterm(X die calls to Remem- 

berA(XX SeiupA4(X and RestoreA4() have 
been deleted. 

Normal, nonresource applications al¬ 
ways follow die main(argc } argv) con¬ 
vention. The standard run-time library con¬ 
tains hidden code to set up any arguments 
to main(), and initialize static construc¬ 
tors and destructors. Code resources, by 
tradition, do not necessarily conform to 
an entry-point standard. 

The Symantec solution for code re¬ 
sources in C++ requires explicit calls to 

the run-lime routines cplusrsrciniK) 

and_ cplmtermi ) witfun die main( ) rou¬ 

tine of the code resource. The run-time 
routines call any static constructors and 
destructors, and make the QuickDraw 
globals available to the code resource. 
Code resources also require routine de¬ 
scriptors, which pi ay a similar role to the 
FroclnfoType parameter used in CaUUni- 
mrsalProc. 

Another feature of die PowerMac is sup¬ 
port for a “fat application"— a single Mac 
app that contains a 68K version in the re¬ 
source fork and a PowerMac version in 
the data fork. Many of the resources, such 
as menus and icons, can be directly 
shared. With a little work, code resources 
can be shared as well, A fat application Ls 
backward compatible widi 68K, System-7 
machines. Although they take up more 
disk space, fat applications neatly solve 
the packaging problem for some vendors. 

Conclusion 

Porting standard applications from 68K to 
PowerPC is relatively simple. The tools 
you have to work with— the Mixed Mode 
Manager, CFM, and your development sys¬ 
tem— will allow' you to gradually port 
your application, develop an application 
dial will run on both die PowerMac and 
68K systems, and create an application ex¬ 
clusively for die PowerMac. 
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PORTING 
Listing One 


IciitToolboxStuffU; 


STACK. ROUTINE -PARAMETER (1 .kFourByteCods); 


ft Macintosh application to create a very simple window and da haaic 
// event handling Paul Kaplan - Symantec Corporation 

ftinclude “InitToolbnxStuff*h H 
Kinelude "MoufleDownProc.b'" 
ftinclude "UpdateWinPrac.h 1 * 
ftdefine VTN_RE5TD 12S 
ftdefine OOEE.RESID 120 

void main{) 

£ 

Static WindawPtr theWindaw. foundWindawI 
static EventRecord theEvent; 

Handle UpdateVlinProcHandle: 

iniftocl baiStrjff (); 

// Setup Window and mouse tracking region 

LhoWindow - GetHewWindow(WlN_RESlD, nil> (GrafPtr)-!): 

RgnHandle mouaeRgn = MewRgnU; 

// Get code resource and lock its handle 

UpdaiaWinPr ocHandle - GetResource[ r CODE r . CODE-REStD): 

ELock(UpdateWinProcHandleJ; 

Boolean mpre2do = TRUE; 

while (noreZdo) // Standard event loop processing 

( 

if (WaitltattEvent(everyEvent P StheEvent. Bstffffffff. raoueeRgji)) 

l 

switch(theEvent.what} 

£ 

case updateEvti // Call the code resource M 

(* tUpda teWinPt ocPtr) (‘UpdateWinFrocHandle) HtbeWindow) : 
break: 

case mouseDawn: ft Standard Mac Toolbox handling of Mouse Down 
wore 2 do = HoueeDownProc [Athelveut. kfoundtfindaw); 
default; 

break: 

J 

} 

i 

// Free all allocated memory 
HUnlockdJpdateWinfrocEandls] j 
Re1easeResourc e(UpdateWinP rocHandleJ; 

DisposeRgfi(BOUSeRgn) ; 

DiBposeWindov(theWindow}: 


Listing Two 

// Code resource procedure to draw text in a window 

l 1 include <$etUpAA.h> 

Bdefine HORIZ 65 
ftdefine VERT 05 

void BainfWindowFtr myWin) 

( 


static char magll = 1 

“bSK Code Reeource w ; 

GrafPtc savedFott; 



Remembe rA 0 ()i 

ft 

Save value of A0 for next macro 

SetUpA40; 

ft 

Set up AA for resource globels 

Get Port (iisavedPort): 

ft 

Save current GrnfPort 

SetPort(myWin)£ 

fi 

Make mine the current GrafPort 

beginUpdate(myWin)| 



MoveTo(HORIZ, VERT); 


// Hove cursor to position 

DrowText(msg, ®. sizeof(oeg)); 

// Draw the string 

EndUpdate(nyWin); 



SetPort(savedFort) 3 

ft 

restore current GrafFort 

ReBtoceAAD: 

ft 

Restore A4 


1 


// Setup Window and mouse tracking region 
t heWindow = CetHewWindow(UIN.RESID. nil. EGrafPtr)-1); 

HgnHandle nouseRgn » NewRgnQ: 

// Get code resource and lock its handle 
UpdateWinProcliandle = GetReSourcef H COTE 1 „ CODE.,RES IB) *, 
HLock(UpdateWinFrocHandle): 

Boolean more2do = TRUE: 

while (moreZdo) ft Standard event loop processing 

C 

if (WaitNextFvem: (everyEvent. AtheEvent* fliffffffff. mousoRgn)} 

( 

switch(theEvent.what) 

I 

case updateEvt: // Call the code resource using 
ft CallOniverBalPcoc instead of 
// calling routine directly 
theUPP = (UniverafllFrocPtrJmpdsteWinPrqcHandle; 
fi Convert dereferenced handle to UEP 
CsliUnivarsalFroe(theUPP, theProcInfo, theWindowl: 
ft Call KiiedKode Manager 
break; 

case monseDown: ft Standard Mac Toolbox handling of Mouse Down 
aioreSdo * MouseDounProc (AtheEvent T A found Window) : 
default: 
break; 

) 

} 

) 

// Free all allocated nemocy 
HUnlock(UpdateWinProcHandLe]i: 

Re1ease Rea our c e[Up dateWioP roc Han dle); 

BisposeRgnCmouseRgn) ; 

Ri apaaeWindaw (thek’itidow) r 

} 


Listing Four 

// Code resource procedure to draw rent in a window 
tinclude <new,h> 


ftdefine HORIZ 65 
ftdefine VERT 95 


void main(WindowPtr myWin) 

( 

static char nsg(] - "PPG Code Resource": 
static GrafPtr aavedfort; 


// Call any static constructors In this link unit. Also Bake 
// QDClobals available 
__eplurrercinit(); 

GetPort (AB.avedPort) 1 ; ft Save current CrafPott 

SetPort(mylrfin); // Make mine the current Graf Port 


BeginUpdate(myWin ]3 

MoveTo(HORIZ. VERT); // Move cursor to position 

DrawTextfnsg, 0. sizeoftmEg)}; // Draw the string 


EndUpdate(nyWin); 

SetFortfsavodFort); ft restore current GrafPort 

cplusrsrctGtmO 3 // Gall any destructors in this link unit 

1 


End Listings 


Listing Three 

// Macintosh application to create a simple window and do basic event handling. 
// Paul Kaplan - Symantec Corporation 

ft include "InitToolboxStuff .h 17 
ftinclude "MouseBawnPrac-h" 
ft Inc 1 ude 11 Upd ateWinProc. h " 
ftdefine WiN.RESlD 128 
ftdefine CQDE„RE£ID 128 

void mint) 

t 

static WindowPtr theWitidow. found Window: 
static EventRecord theEvent: 

Handle UpdateUinFrocHandle: 

ft Variable to hold Universal Proc Pointer 
UniversfllProcPtr theUFP: 

ft Proc Info Type - describes the called procedure’s interface 
FroelnfoType theProdnfo = kCStsckhased ! 
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OPTIMIZING 


Op timizing for the 

PowerPC 


W ith its great speed, low power 
requirements, and flexible pro¬ 
gramming model, die PowerPC 
represents a jump in micro¬ 
processor technology. Conse¬ 
quently, many developers are beginning 
to port their applications from the Intel 
architecture to the PowerPC. The Power- 
PC and the Pentium, however, represent 
two very different approaches to computer 
architecture, and moving applications from 
one platform to the other with a minimum 
of fuss is not always easy. You may need 
to change development platforms, targets, 
or tool vendors. 

At MetaWare, weVe ported our C/C++ 
compilers to the PowerPC, In doing so, 
weVe learned a few tricks that Ill share 
with you in this article. Ill also describe 
some of die techniques we use to improve 
the code our compilers generate for the 
PowerPC, For the basis of my dis¬ 
cussion rH use the 601 model. 

PowerPC Quick Tour 

One interesting feature of the Pow¬ 
erPC is the branch processing unit 
(BPU); see Figure 1. The branch pro¬ 
cessor doesn’t depend on either the 
integer or floating-point units; it 
works in concert with the instruction 
unit to keep instructions flowing. The 
BPU can look ahead in die instruc¬ 
tion queue for a branch instruction 
and use static branch prediction on 
unresolved conditional branches to 
permit fetching instructions from the 
predicted target instruction stream. 
When prediction is correct, a branch 
can be performed in zero clock cy¬ 
cles, Hits feature of the PowerPC ar- 


Michael, a software engineer for 
Meta Ware, can he contacted at miker 
®metmmre. com, 


Strategies for 
greater 
performance 


Michael Ross 


chitecture is similar to die branch table of 
the Pentium or the branch target buffer 
and the reorder buffer of Intel’s upcom¬ 
ing processor, the P6. Hie Pentium’s 256- 
element Branch Table for dynamic branch 
prediction does not boast the same suc¬ 
cess rate as the PowerPC’s BPU, and caus¬ 
es a 3- to 20-cycle penalty if die predic¬ 
tion fails. This is also true on die P6, The 


PowerPC BPU has more built-in capability 
that doesn’t rely on the integer-processing 
unit. Mispredicted branches, on average, 
incur less penally than in die P6 and Pen¬ 
tium. This is because die instruction queue 
is only eight instructions long, and a flush 
of the queue is likely to incur only a 1- 
or 2-cycle penally. 

The BPU has three special-puipose reg¬ 
isters that are noL part of the usual general- 
purpose registers: 

* Link register (LR), 

* Count register (CTRL 

■ Condition register (CRT 

The BPU calculates and saves die re¬ 
turn pointer for subroutine calls in the LR. 
The CTR contains the target address for 
some conditional branch instructions. The 
LR and CTR can be easily copied to or 
from any general-purpose integer 
register. Because the BPU has these 
special-purpose registers, all branch¬ 
ing except for synchronization can 
be carried out independent of the 
integer and floating-point units. Un¬ 
like die Pentium, which has many 
special conditions that muse be ful¬ 
filled before instructions can exe¬ 
cute in parallel (often producing 
stalls), the PowerPC's BPU unit helps 
prevent stalls from occurring. Since 
the PowerPC architecture reserves 
these registers (LR, CTR, CR) for the 
branch processing unit, compilers 
have more integer registers available 
for alloca tion of important variables. 
From the compiler’s point of view, 
die number of fast, general-purpose 
registers available on a processor is 
a key factor in the exeoidon speed 
of applications. In this respect, the 
PowerPC has the Pentium beat. The 
Pendum has only eight 32-bit inte- 
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ger registers: ESP and EBP are dedicated 
to special purposes, and other registers are 
implicitly destroyed by certain instructions, 
creating a headache for register allocation. 
The P6 processor (with 40 general registers 
and a hardware-register- renaming .scheme) 
helps, but it doesn't solve the problem, 
due to tiie need for backward compati¬ 
bility. Under the PowerPC application bi¬ 
nary interface (ABI), the compiler has 15 
general-integer registers that may lie used 
just for local variables. On the Pentium, 
some optimizations such as strength re¬ 
duction for array accesses simply cause 
too much competition for registers, and 
rarely pay off. The PowerPC allows you to 
lake advantage of all the possible opti¬ 
mizations at your disposal 

The instruction unit actually contains 
the instruction queue and the BPU, pro¬ 
viding a central control over the instruc¬ 
tions issued to the execution units, the in¬ 
teger unit, and the floating-point unit. The 
instruction unit determines the next in¬ 
struction to be fetched from the 32-byte 
cache and controls pipeline interlocks. It 
allows out-of-order execution when in¬ 
structions do not depend on the result of 
an instruction currently executing. 

The PowerPCs integer unit (IU) per¬ 
forms all loads, stores, and integer arith¬ 
metic. It contains an ALU and an integer 
exception register (XER), Most integer in¬ 
structions execute in one dock cycle. 
Loads and stores are issued and translat¬ 
ed in sequential order, but actual memo¬ 
ry access can occur out of order. Syn¬ 
chronizing instructions are available for 
times when order is critical The IU con¬ 
tains 32 integer registers, each 32 bits wide. 
64- bit versions of the PowerPC are com¬ 
ing soon. 

The floating point unit (FPU) is fully 
IEEE 754 compliant and contains 32 
floating-point registers, each of which can 


hold either a single- or double-precision 
operand. The FPU can look ahead and 
find instructions in the queue that do not 
depend on unexecuted instructions and 
process the latter early. The FPU is pipe¬ 
lined, so that most instructions can be is¬ 
sued back to back, without stalling. Un¬ 
like the stack-style access to floating 
operands of the Intel chips, the floating¬ 
point registers allow true random access, 
so complex scheduling and swapping al¬ 
gorithms are not necessary to achieve 
good performance. 

Measuring Performance 

Tile difference in pliilosophies between 
the PowerPC and the Pentium becomes 
more evident as you begin to analyze code 
and performance. The Pentium chip places 
die burden of good performance square¬ 
ly on the compiler writer's shoulders. The 
compiler writer has to be aware of all die 
special conditions that ailow r parallel ex¬ 
ecution of integer instructions in the U 
and V pipes on die Pentium, and the many 
conditions where instruction pairing isn’t 
possible. The PowerPC lacks all these con¬ 
straints and provides a lot of hardware as¬ 
sistance to make the job easier. Com¬ 
pounding the performance problem for 
application vendors is the fact that the 
same sequence of instructions won't run 
universally well across the family of Intel 
processors. For example, to gain perfor¬ 
mance on the Pentium, you should re¬ 
place integer multiply instructions with a 
sequence of less complex si lifts and adds 
where passible. However, on other Intel 
processors such as the P6, you should do 
just the opposite—use the integer multi¬ 
ply instruction rather than adds and shifts. 
On the 80486, you need to be concerned 
wadi whether an instruction is aligned so 
that it classes a 32-byte boundary, while 
on the Pentium diis makes little difference. 


Eight 

Words 



Tags 32-KB Cache 

Instructions and Data 


SYSTEM INTERFACE 


It’s much easier to get an application that 
has uniform performance across the 
PowerPC family. A spreadsheet vendor 
might have to compile for die lowest com¬ 
mon denominator on the Intel family 
(probably the 80486), so in many cases, 
customers would not get to use die pow¬ 
er of their Pentium processors. The main 
thing Pentium has going for it is a huge 
installed base, and a lot of shrink-wrapped 
software for Windows/NT, DOS, and OS/2. 

Although no benchmark is a real indi¬ 
cation of how well or poorly your appli¬ 
cation will am on a given platform, die 
following two programs are more than just 
benchmarks, because people really use 
them in their work. 

The first, Espresso, is an almost exclu¬ 
sively integer benchmark. Several "hot 
spots" dominate its execution time, one 
of which is the routine massim_count in 
cofactor.c This routine is mainly a large 
sequence of If-Then-Else constructs that 
should be a natural for branch-prediction 
and cross-jumping optimizations. The C 
source code for Espresso is shown in List¬ 
ing One; listings begin on page 17. No¬ 
tice that in the first two loops, the bulk of 
the operations are of the form: if (ml & 
<consktnl> ) cnt[<constmU suhscrpt>]++;. 
Hi is form shows that each operation is in¬ 
dependent of successive ones, A good 
compiler will keep ml in a register, along 
with the base address of the array ent, 
and the PowerPC will fill the integer 
pipeline so that instructions keep execut¬ 
ing at a one-per-cycle rate without stalls. 
Because the value in each condition is a 
constant, the BPU should be able to eas¬ 
ily predict the need to branch or fall 
through, The compiler should hoist the 
load of die base address of cat out of each 
If-Then-Else construct. 

In assembly-language format, most Pow¬ 
erPC instructions use the rightmost two 
registers as operands and die leftmost as 
the destination, The code in Listing Two 
was generated by the Meta Ware PowerPC 
C/C++ compiler for Solaris on PowerPC. 
The only surprise in listing Two is that die 
test for the next If is scheduled ahead of 
the increment for the last one. The reason 
is simple: The 60J BPU is highly inde¬ 
pendent of the integer-execution unit, and 
the add and store instructions do not af¬ 
fect the condition codes in die BPU. Mov¬ 
ing the test up avoids a stall due to a de¬ 
pendency on %rlO from one instruction 
to the next. The distance between branch 
instructions is small, allowing die BPU to 
look ahead in the queue and easily direct 
the instruction stream fetch to the next set 
of instructions for execution. Note dial the 
only memory references are those that 
are absolutely necessary. On our 60-MHz 
PowerPC 601 w ith 32 MB of RAM, Espres- 


Figure 1: PoiverPC 601 block diagram. 
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so executes in 48 seconds (cumulative time 
for all of the SPEC data input sets from the 
input ref directory). It has been estimated 
that one in five instructions is a branch, so 
the importance of the BPU in the archi¬ 
tecture really shines here. Naturally, new¬ 
er versions of the PowerPC, such as the 
604, would be even faster 

For comparison. Listing Three is out¬ 
put from the version 2.6d of the 
Meta Ware C/C++ compiler for Unix V E4 
386, on UnixWare LI. We compiled on 
a 60-MHz Pentium with the -586 switch 
for Pentium optimization. Hie ability to 
increment the array element to memory 
w ithout loading it reduces the number of 
instructions needed to perform the loop. 
However, the real reason for tills is the 
paucity of registers into wliich to load the 
array element, Pentium and P6 opti¬ 
mization lore would suggest that better 
instruction overlap might occur if this 
were more like the RISC load/store mod¬ 
el. With the Pentium, unfortunately, this 
would generate an unacceptable number 
of spills. The P6, with its dual-integer in¬ 
struction units and register renaming, may 
make this more feasible. Surprisingly, the 
Pentium is only a hair slower executing 
tills code, managing to do all of the SPEC 
data sets in 50 seconds, a difference of 
4 percent. The difference here appears 
to be in the time needed to make mem¬ 
ory references, and the fact that the tests 
and branches are not independent of the 
integer unit. Note, however, that while 
this code would run the same or faster 
on all PowerPC varianLs, the same Is not 
true of the 80x86 family, not because of 
clock speed, but because of differences 
in architecture, A 100-MHz 486 could not 
expect to do as well on this code, since 
each branch to the top of the loop would 
incur a 2-cycle penalty. This shows the 
value of branch prediction. The Pentium 
will suffer on integer code from its in¬ 
ability to pair shift and rotate instructions 
with variable shift counts, mul and div 
instructions, and some floating-point in¬ 
structions. In spite of Pentium's parallel 
U and V pipes, some instructions don’t 
execute in anything but the U pipe. This 
forces the compiler to schedule instruc¬ 
tions so that two U pipe instructions don't 
occur consecutively. 

Floating-Point Operations 

Though less common in mainstream ap¬ 
plications, floating point is certainly im¬ 
portant for scientific programmers. As an 
exercise in portability and because I’m die 
author of Meta Ware's Fortran compiler, I 
decided to do the floating-point compari¬ 
son for this article using Fortran, To do this, 
1 had to port the Fortran front end and li¬ 
brary to Solaris PowerPC. To my surprise, 


the entire process took under four hours, 
most of which was actual compiling and 
linking, and the result was a compiler that 
passed ail but three of the Fortran 77 val¬ 
idation suite tests on the first try. (It’s been 
passing entirely on other platforms for 
years.) Solaris made the process painless. 

With the compiler ported, I was able to 
run the familiar UNPACK benchmark. LIN- 
PACK is showing its age, but it Ls still used 
in a wide variety of real applications. One 
routine, SAXPY, contains a loop that is the 
main rime consumer in the benchmark. 
The PowerPC version of this loop is in 
Listing Four Here you can see another 
thoughtful aspect of the PowerPC archi¬ 
tecture. Not many RISC architectures in¬ 
clude instructions like floating multiply 
and add (or subtract). The IIP PA archi¬ 
tecture has such an instruction, but its re¬ 
strictions make it unusable for UNPACK. 
The Pentium, for all its CISCness, includes 
no such instruction. The compiler has re¬ 
duced die addressing computations in diis 
loop by doing strength reduction. And of 
course, diere's no lack of general floating¬ 
point registers to work with. 

Listing Five shows how the Pentium 
compares. In spite of best efforts and some 
good optimization techniques such as loop 
reversal and loop unrolling, Pentium 
doesn’t come off too well here. The ma¬ 
jor problem again is lack of registers and 
the rather odd floating-point stack archi¬ 
tecture, Pentium turns in a performance 
of 7.6l6 Mflops, compared to the Power¬ 
PCs 8.58 Mflops. 

Optimization Techniques 

The compiler uses a number of techniques 
to make your code run more efficiently: 
common-subexpression elimination, dead- 
store elimination, register allocation, and 
global-constant propagation. The impact 
of these optimizations depends upon the 
architecture and the application code it¬ 
self. For example, Listing Six shows the 
assembly-language output on the Pentium 
using the SAXPY loop without any opti¬ 
mization turned on. 

Listing Six also shows the useless over¬ 
head in loading the address of the array 
element and performing the loop. The net 
effect is a loss in performance down to 
5317 Mflops. 

The general optimizations pay off. If you 
compare the optimized and unoptimized 
code, you’ll mainly see the effects of reg¬ 
ister allocation, loop unrolling, and induc¬ 
tion elimination. But for Fast code, each 
code generator needs to pay attention to 
the specific quirks of the target machine. 
For each architecture, the Meta Ware com¬ 
piler uses two phases, massage and ex¬ 
pand, that work on the intermediate lan¬ 
guage form of the program. Here, the 


compiler must consider whether the ma¬ 
chine has transcendental floating-point in¬ 
structions and scaled-indexing addressing 
modes, and whether multiply or a series 
of shifts and adds is faster for that particu¬ 
lar processor. On the PowerPC, for exam¬ 
ple, the compiler does not replace constant 
multiply with shifts and adds unless the fi¬ 
nal instruction sequence is no more than 
two instructions longer. The reason is that 
multiplies are fairly cheap on a PowerPC, 
On the Pentium, you look for things like 
block moves that might tie up too many 
registers and decrease the register pressure 
by combining base and index registers. On 
most of the architectures, these phases try 
to eliminate the use of a special frame- 
pointer register where possible and use the 
stack pointer instead, freeing up another 
general register for other purposes. With 
die Pentium, this is particularly important, 
since registers are so scarce. 

A third phase for code improvement 
combines peephole optimization and 
scheduling. Each code generator has die 
option of both a high-level scheduling 
pass on the intermediate language and a 
low-level scheduling pass on Lhe actual 
machine instructions. Because tliis is table 
driven, the code that actually does the 
scheduling can l:>e the same in both cas¬ 
es, The tables take into account the va¬ 
garies of Li and V pipe pairing on die Pen¬ 
tium and the pipeline stages of results on 
the PowerPC, 

Programming for Performance 

As is evident from the PowerPC design, 
hardware dynamic-branch prediction is 
beginning to supplant the older h ranch- 
delay slot design, where an instruction 
was moved after a branch to execute 
while waiting for the branch to take ef¬ 
fect. Most chips have a finite cache or in¬ 
struction queue that they can examine in 
order to predict die branch. If you keep 
the distance between a conditional 
branch and its target small and use ex¬ 
pressions that are easily dynamically eval¬ 
uated by something like the PowerPC’s 
BPU, you'll increase the chance that the 
BPU will effectively predict whether the 
branch will be taken. Also, if the target 
and the branch fall within the cache size, 
your code will likely execute faster. You 
should separate expressions and their 
consumers as widely as possible, w idiin 
cache limits. For example, given the ex¬ 
pressions in Figure 2(a), consider re¬ 
ordering the statements to Figure 2(b). 
Tliis makes it possible to complete aU the 
operations of the first statement while 
processing some of the independent op¬ 
erations of the second, thus avoiding any 
possible stalls waiting for the completion 
of the first statement. 
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OPTIMIZING 


HE ABILITY TO STOP 
CHILD ABUSE 


IS NOW IN ALL OUR HANDS 


For years, child abuse has been a problem to whieh there were 
few real answers. But now there's an innovative new program 
designed to help stop the abuse before it can ever begin. It’s already 
achieving unprecedented results. But we need your help where 
you live. Please call 1-800-C H I L B R E N today to learn more. 
THE MORE YOU HELP THE LESS THEY HURT. 


1-800-CHILDREN 


Three reasons why 
Software Careers works: 

1 . 

It's targeted to the software industry's most 
sophisticated audience of software programmers 
— the 75,000+ U.S. subscribers of Dr. Dobb's 
Journal and Developer Update. 

2 . 

It's timely, Iwice a month. 

3 . 

It's affordable. 

You won't find a more cost-effective place 
to recruit programmers. 

Call Stephanie Dale today to book a recruitment ad that works. 

(415) 655-4197 

Dr. Dobb’s Software Careers 



Figure 2: Reordering statements to 
improve performance. 


For processors like the Pentium, look 
through the critical regions of your code 
after profiling your application. If your crit¬ 
ical loops have more than four or five 
heavily used variables, try to reduce that 
number. Since the Pentium only has a few 
general registers available for local vari¬ 
ables, you can give the compiler a boost 
in optimizing your code if you confine the 
number of heavily used variables or con¬ 
stants to a small number in loops. For 
floating-point code, think about ways to 
break down transcendents Is or floating¬ 
point divide instructions into different ex¬ 
pressions. For example, the Meta Ware 
compiler tries to replace ARCSIN with 
ARCIXA2(X t SQRT((l-XMl+X)X which 
can be done with inline code. 

By breaking down complex operations 
into simpler ones, you’ll give most pro¬ 
cessors Lire chance to schedule your code 
for faster execution. You can’t always de¬ 
pend on your compiler to make transfor¬ 
mations for you, Keep your floating-point 
expressions smaller than the size of the 
Pentium’s floating-point stack. If you write 
a huge expression with more than seven 
or eight intermediate results, you are 
forcing the compiler to spill from the stack 
to some temporary location, or to use 
memory-reference instructions to com¬ 
plete the calculation, resulting in slower 
code. A little extra care in the critical re¬ 
gions of your program, keeping the char¬ 
acteristics of current processors in mind, 
can pay large dividends in performance. 

Conclusion 

With tire PowerPC’s price-to- performance 
ratio closing in on that of the Pentium, 
users are finding this architecture more at¬ 
tractive. In addition to the Macintosh, there 
wall soon be opportunities to field appli¬ 
cations on Solaris and OS/2 for the Pow¬ 
erPC. The Mac already has fairly success¬ 
ful DOS 80x86 emulation, enabling you 
to run a lot of shrink-wrapped software 
without sacrificing your software invest¬ 
ment. New applications can benefit from 
the scalability of the PowerPC family and 
the new speed it offers. However, you’ll 
need to rethink strategies in order to wring 
out that last ounce of performance, 

DDJ 
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Listing One 


♦include "espresso.h* 

void mssatve_countfpcube *T) t 

lot *count =■ edata.part.zeros; 
pcube *Tl; 

/• Clear the column counts (count of f Keros in each column) */ 

1 

register int i; 

for{i ■ cube.size - i; 1 >» 0; i —) 

count [i ] = ft: 

1 

/* Count the number of zeros in each column +/ 

£ 

register int i. *ent; 
register unsigned, int val: 

register ptube p. eof ■ T[0] , full * cube.ftillset: 
for(Tl = T+2; ( P = *T1++) 3 = mu*; J 
for{i = LOOP(p); i > ©; 1--J 

if (val = full [i] k ~ £p[il 3 cof[il}) { 
eat ■ count + ((i-1) « LOCBPI); 


tif BP1 = 

It 






if 

(val 

& 

0iEF000000) 

t 


if 

(val 

6 

0X00000000) 

cut[3l|++ 

if 

(val 

k 

0x40000000) 

ent[30J++ 

if 

(val 

£. 

0x20000000) 

ent [29] ++ 

if 

(val 

6 

0X10000000) 

ent[20]++- 

if 

(val 

6 

0x05000000) 

ent[27]++ 

if 

(val 

4 

0X04000000) 

cut [26]++ 

if 

(val 

& 

0x02000000) 

ent(25]++; 

if 

) 

if 

[val 

6 

0X01000000) 

ent(24]++; 

(val 

fr 

0X00FF0000] 

t 


if 

(val 

& 

0x00300000) 

ont[23]++: 

if 

(val 

6 

0X00400000) 

cut[22]++; 

if 

(val 

k 

0x00200000) 

cut[2lj++; 

If 

(val 

k 

0X00100000) 

cnt[20j++; 

if 

{val 

k 

0X00000000) 

ent[t9l++; 

if 

(val 

k 

0X00040000) 

cnt[lflj++; 

if 

(val 

k 

©100020000) 

cut[17]++; 

if 

(val 

k 

0X00010000) 

cut [16(++; 

ftendif 







if 

(val 

k 

0IFF00) 

C 



if 

[val 

k 

0*8000) 

cut [15]++ 


if 

[val 

k 

0x4000) 

cut [14] *+ 


if 

[val 

k 

0*2000) 

cut(11]++ 


if 

(val 

k 

0x1000) 

cut[12] *+ 


if 

[val 

k 

0M0&00) 

cut ( 11]++ 


if 

(vaJ 

k 

0K0400) 

■cnt[10] ++ 


if 

(val 

k 

0x0200) 

ent [ 

9]++ 


if 

) 

(val 

k 

0X0100) 

cut[ 

8] ++ 


if 

del 

k 

0I00FF) 

£ 



if 

(val 

k 

0x0080) 

ent [ 

7]++ 


if 

(val 

k 

0X0040) 

cnt[ 

6] ++ 


if 

(val 

k 

0x0020) 

ent [ 

5]++ 


if 

(val 

k 

0X0010} 

cut[ 

4]t+ 


if 

(val 

k 

0x0003) 

ent | 

^] ++ 


if 

(val 

k 

0*0004) 

CEltl 

2] ++ 


if 

(val 

k 

0x0002) 

cut ( 

n++ 


if 

(val 

k 

0X0001) 

ent I 

0]++ 



J 

1 

J 

h 

* Perfons counts for each variable: 

* edata,var.zeros[var] = number of zeros in the variable 

* edata,parts.active[var] ■ number of active parte for each variable 

* edata.vara.active = number of variables which are active 

* edata,vars_unate = number of variables which are active and unate 

* 

* best — the variable which is best for splitting based on; 

* mnstective — most t active parts in any variable 

* moGtzero — most it zeros in any variable 

* moBtbalanced — minimum over the maximum ♦ zeros / part 7 variable 
*/ 

{ 

register int var. i, lastbit. active, maxactive: 

int best - -1, bos Laclive - fl. moBtzero = 0, bostbalanced = 12000: 
edata-vars.unate = edata.vars.active = 0; 

for[var = 0: var < cube-mim_vars: var++) [ 

if (var < cube.HUm.binary.vars) [ /+ special hack for binary vers */ 
i = count[var*2]; 
lastbit - c0aJnt[var*2 + l] : 
active = [i "> 0) + (lastbit > 0) ; 
edata, var„ZEros .Var] = i + Lastbit: 
maxactive = KAX(i, lastbit); 

1 

eleeC 

amiaCtiva = active = edata. var.zeros [var] = 0: 
lastbit = cube.last.part[varI; 

for(i - cube.firat.part[var]: i <= lastbir: i++) £ 

cdatB.vat-zeros[var] += countti]: 

active ■*=“ (count [i] > 0): 

if factive * maxac Live) maxactive » active; 

1 


i* first priority is to maximize the number of active parts */ 

7* for binary case, this will usually select the output first *7 
if [active > mostactivc) 

best - var, nostactive = active, mostsero = edata,var.zeros[best]. 


bos( balanced = maxactive; 
else if {active = mostactive) 

7* secondary condition is to maximize the number zeros *7 

7* for binary variables, this is the same as minimum I of 2's */ 

if (edata.var.zeros[vat] > taoStiero) 

best = var, mostzero = edata,var.zeros[best]. 

ttostbalanced = maxactive; 

else if Ccdata .var.zeros [var] =- mostzaro') 

/* third condition 1 e to pick a balanced variable */ 

7* for binary vara, this means; roughly equal i 0's and l's */ 

If (maxactive < mostbalanced) 

best = var, mostbalanced = maxactive; 
edata.part s .ac tive[var] - active; 
cdEtP-is_nnate[vnr) = (active = I); 
edata. vars. act ive +=* (active > 0); 
edata.vats.unate *= (active = ]}; 

3 

edata.best = best; 

3 

3 


Listing Two 


im 

1140 


1142 \ 


! if 

[val ^ full[i] 

k - [p(i] ; cof(i]3) £ 

all 

XrS. 5Lr 11.2 

f+a4 Shift i. which is %rll, left 2 

Ivzx 

^rl0,%r8,%tL2 

f+a8 Add ftrStp) and ^trl2 to form address, load 

Ivxx 

%r7 ,%r8.%r4 

r*ac Add 'trA (cof) and %r8 (i<<2), load 

nor 

%ri0,ftr7.%tl0 

l+b0 or not operation 

Iwzx 

^Td.^rd,*r5 

1+b4 Add S*r5 (full) 4 *r8(i), load 

and. 

%r8,itra,4trl0 

J +b& and of full[i] kr (pfil ] cof[i]) 

1 val ends up in %r8 

beq 

..LL63 1+bc 

cut = count + 

[(i-1) « LOGBPI): 

sli 

«fl0 a %rll>7 

!+c0 

addi 

Srfi.Ttrl©,-123 

I+c4 

add 

%x7.*r29.*£t 

!+c8 Get baEe addreEE aaElgned to ent in Sr? 

lif SPl = 

= i a 

if 

Cval k 0*FF000000) f 

andia- 

*rl0,%rB.652®0 

(♦cc Test bits in val 

bep 

. ,liL64 !+d0 

Branch if bits not set 


if (val k 0x50000000} cntUl]*+; 

andis. 

^rl0,Sr8.32768 

!+d4 Test bits in val 

baq 

,,LL65 r+dS 

Branch if bits not set 

Iwz 

*rl0,+124for7) 

1+dc &aE6 addreSE already iti H&r?, load element 

andi®. 

*r31,4tr8.i63S4 

1 *e0 .Scheduling places test of val ahead 

3 to avoid Integer pipeline stall 

addi 

%rl0.%rl0,l 

l+e4 add 1 to ent[31J 

stw 

%tl0,+l24[^r7) 

!+e& Store element to Bemory 


if (val k ©x40000000) cut[30]++: 

beq 

..LL66 1+ec ] 

Branch on pretested condition 

b 

..LL67 l+f0 


andis. 

■*rI0,%rfla63&4 

!+f4 Teat next bits in val 

beq 

.,LL&6 ’+f& 

Branch if not set 

Ivz 

trl©,+120{Ttr7) 

!+fc Load Cnt(30J 

andis. 

%ril.%E8,8192 

H100 Test next bit£ 

addi 

trl0.*ri0,l 

! +10^i Inc remeot cnt [ 10 ] 


stw fcrl0,+120(tr7) F f 10S Store back to memory 


listing Three 


/136 : for (i = LOOP(p): i > 0: ±—) 
movl (%edx ).■'+/>,1 v 

andl $1021 .'Jteax / 0x3ff 

tcstl teax.^eax / initialize i, get it into %eax, teat for zero 
jle ,IA3 

,L44: 

/I37 i if (val = full[i] 4 ~ Cp Li I ! cofUJ)) ( 



movl 

[%edx,%£ax,4} ,%ebx 

/ Load eaf[1] into ^ebx 


m.avl 

6fi(Sesp),%esi 

/ Load index into *esi 


orl 

(%esi,^eax.4},%ebx 

7 or togethet, from memory 


movl 

64 (Iteap) ,?tesi 

/ load full[il index 


notl 

%ebx 

7 not Of expr 


andl 

CJesi ,*eax,4) ,'Sebx 

/ end from memory 


r .c t; L1 

je 

tebx,tebx 

,U5 

/ val now in %ebx 

/133 : 

ent - count + ((i-1} 

« LOGBPI); 


movl 

feeax.^edi 

/ C o py i 


eh.ll 

$7.^41 

7 Shift left 


festl 

$-16777216,%sbx / 0xff000000 f Test bits in val 


lea 

-128{ftedi.*ecxj,%esi 

/ Get base address into %ei 

/139 : 

: Hf BPI 

= 32 


7140 : 

If 

(val k 0xFF00000&) t 



je 

,L46 

/ Branch on bits not st 

7141 : 

if (val k 0x80000000) 

1 cut[31]++; 


testl 

$-2147483648,%ebx 

7 Tost bits in val 



.L47 

7 branch on bits not aet 

.L47: 

incl 

-4(%ed.i.%ecx) 

/ Increment cnt[31] 

7l 42 : 


if (val h 0x40300000) 

' cnt£9Ni++; 


testl 

jc 

itbcl 

$10737 41824 
,L48 

L 20(%esi) 

7 0x40000000 


,UB; 
/143 : 


if (val & 0x20000000) cnt[Z9]++; 
test L 5536070912,^^ / 0x20000000 
je .L49 

inci 116 (%?si) 


(continued on page 18) 
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OPTIMIZING 


Indexes of 

Dr. Dobb’s Journal 

1982 -1994 


• Cumulative Subject and Author 
Indexes for the years 1982 through 
1994 in detail, bound in ONE 
volume. 

• An 8"x11" paperback with 159 
pages and 10,700 entries. 

• Detailed coverage of all editorial 
material, including reviews, bug 
reports, letters, tables, sidebars, 
corrections, etc. 

• Find that article, program, product 
review, algorithm, or routine (by 
language or type) in seconds, or 
follow the threads of a technical 
conversation. 

• Includes coverage of Dr. Dobb's 
Developer Update. 

• $23.95 for the book (Canada 
$25.50 U.S.). 

• Hypertext version, which is 
updated quarterly, is available on 
3.5" disk (DOS); $14.95 (Canada 
$15.70 U.S.). Updates are $10.00 
(Canada $11.00 U.S.). 

• MasterCard and Visa accepted. 

To Order Call: 
804-977-7015 or 
800-678-2848 

Stephen Bach 
1208 Meriwether St., 
Charlottesville, VA 22902-5421 

For shipping paperback via first class mail add $5.00 
(Canada $6.00 U.S.). 


(continued from page 17) 

Listing Four 

1459 | do 30 i - I.n 

add! %rU.'fa:7i4 1+6B Pointer to dy(i) 

add! %rl0.%ri.4 1+bc Pointer to dx(i} 

■tcti *r9 M-70 

add! *ri0.*rl0,-S r*74 Array bias 

sddi %rU*Sril,-8 E+78 

!460 } dy(i) * dy(i) + da*dxti) 

If a l+7c Load dyCi) 

lfau %f0,+4(%rl&) !+80 Load dx(i) 

fpadde 12 1+84 Floating multiply and add 

atfflti ^f0,+4(fcrll) 1+81 Stare result into dy[i) 

I 461 I 30 continue 

addi %Kl2 r %Tl2,l l+8c Increment loop counter 

bdnt ..11.97 t+90 teat and branch non-aero 

h ..LL9B f +94 


listing Five 


/459 ; 

do 30 i * l.n 

cmpl 

$2.%eex / Check for loop execution 

movl 

68(%eep).feel / Get pointer to dx 

movl 

d0(*eap),%edx / 

jl* 

.LB2 

jrap 

.IJ3 

+ L31i 

.L9L: 

DOVI 

%eax.%edi 

.ta3: 

/ 460 : 

dy(i) = dy(i) + da*dx(i) 

fide 

-4C%edx 1 %edi I 4) / load dxfi) 

fmjul 

.*St / da * dxU) 

add I 

$-2,$ecx / Loop was reversed to 

lea 

2{*edi)<*e«s 

fadds 

-4(%eai,%edi,4) / Add dy(i} 

cmpl 

$2,3ecx 

fat pa 

-4 ( i *eai T iledi, 4) / Store result to dyC 


/461 : W continue 


fids 

frmil 

fadds 

fstps 

Ja 

j»p 


(%edx h %edi,4) 

(%esi.%edi.4) 
(lesi.%edi.4) 
.LSI 
, 1.92 


/ Went loop iteration, leap unrolling 


Listing Six 


/459 ; do 30 i = 

■dvI Sl.*L9S*I 

dec! %ecx 

lea l(%ccx).«cax 
and! ^eex.'fceax 

jle ,L55 

jasp ,L57 

.L57: 

/460 : dy(i} = dy{l) + da*dx{i) 

movl .L9B.I K %ecx / I is now Static T loaded from memory 
movl 24(Vbp) /fcedi 

fldH -4fieili.%ecat.4l 

movl 16(%ebp>,£edx / Load index 

movl 12(%ebp).%esi / Load address of array element 

fide -4ftjedx.&ecx.4) / Load one array element to floating stack 

fmula (*eei) / Kultiply 

faddp *st,fcst(l) / Add 

fstpa -4 (Ttedi.toex.4) t Store result 

/4fil t 30 continue 

incl %ecx / Co loop book keeping 

tnovl %tCx».L98*I 

dec] %eex 

andl ^eax,'fceax 

H 


End Listings 


IS 
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601 & 21064 


PowerPC 601 and 

Alpha 21064 


J usL as there is more than one way to 
skin a cat, there is more than one 
way to implement RISC concepts. 
The PowerPC Is a good example of 
a high-performance RISC imple¬ 
mentation that is tuned to a specific ar¬ 
chitecture. It isn’t, however, the only RISC 
implementation style that processor de¬ 
signers have used. We’ll compare the 
PowerPC 601 to an alternative RISC archi¬ 
tecture and implementation—die DEC Al¬ 
pha 21064, 

The 601 focuses on relatively powerful 
instructions and great flexibility in in¬ 
struction processing. The 21064 depends 
on a very fast clock, with simpler in¬ 
structions and a rigid implementation struc¬ 
ture. Both the 601 and the 21064 have 
load/store architectures with 32-bit, 
fixed-length instructions. Each has 
32 integer and 32 floating-point reg¬ 
isters, but beyond these basic prop¬ 
erties, they have little in common; 
see Table 1. 

The 601 has a relatively small die 
size due to IBM’s aggressive 0,6- 
micron CMOS technology with four 
levels of metal (a fifth metal layer is 


Shlomo is a faculty member in the 
Department of Electrical Engineer¬ 
ing/Systems at Tel Aviv University. Jim 
is a faculty member in the Depart¬ 
ment of Electrical and Computer En¬ 
gineering at the University of Wts- 
consin-Madison . They are the authors 
of POWER and PowerPC; Principles, 
Architecture, Implementation (Mor¬ 
gan Kaufman n Publishers, 1994). 
Shlomo and Jim can be reached 
through the DDJ offices. 


Second generation 
RISC processors 


Shlomo Weiss and 
James E. Smith 


used for local interconnect); see Table 2. 
The cache size of each chip largely ac¬ 
counts for the substantial difference in the 
transistor count. Two striking differences 
appear in clock cycle and power dissi¬ 
pation. The 21064 is much faster, but also 
runs much hotter. IPs well-known that 


CMOS's faster clock gives it more power, 
but even if a fast clock “wins 71 in perfor¬ 
mance, its higher power-consumption re¬ 
quirements could “lose” in usefulness— 
in portable PCs, for example. 

PowerPC 601 Pipelines 

All instructions for the 601 are processed 
in the fetch and dispatch stages. Branch 
and Condition Register instructions go no 
fanher. Fixed-point and load/store in¬ 
structions are also decoded in die dispatch 
stage of die pipe and are then passed to 
the EXU to lie processed Most fixed-point 
arithmetic and logical instructions take just 
two clock cycles in die FXU; one to exe¬ 
cute and one to be written into the regis¬ 
ter file. Ail load/store instructions 
have three cycles in the FXU: ad¬ 
dress generation, cache access, and 
register write. This assumes a cache 
hit, of course. 

The 601 design emphasizes get¬ 
ting die FXU instructions processed 
in as few pipeline stages as possi¬ 
ble. This low-latency design Ls evi¬ 
dent in the combining of die dis¬ 
patch and decode phases of 
instruction processing. The effect or 
an instruction pipeline’s length on 
performance Ls most evident after a 
branch, when die pipeline may be 
empty or partially empty. 

The shorter die pipeline, die more 
quickly instruction execution can 
start again. Most of the time, die first 
instructions following a branch are 
FXU instructions (even in floating¬ 
point-intensive code), because a 
program sequence following a 
branch typically begins by loading 
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601 & 21064 



PowerPC 601 

Alpha 21064 

Basic architecture 

load/store 

load/store 

Instruction length 

32-bit 

32-bit 

Byte/haifword load/store 

yes 

no 

Condition codes 

yes 

no 

Conditional moves 

no 

yes 

Integer registers 

32 

32 

Integer-register size 

32/64 bit 

64 bit 

Floating-point registers 

32 

32 

Floating-register size 

64 bit 

64 bit 

Floating-point format 

IEEE 32-bit, 64-bit 

IEEE, VAX 32-bit, 64-bit 

Virtual address 

52-80 bit 

43-64 bit 

32/54-mode bit 

yes 

no 

Segmentation 

yes 

no 

Page size 

4 KB 

implementation specific 


Table 1: Architectural characteristics. 



PowerPC 601 

Alpha 21064 

Technology 

0.6-micron CMOS 

0,75-micron CMOS 

Levels of metal 

4 

3 

Die size 

1.09 cm square 

2.33 cm square 

Transistor count 

2.8 million 

1,68 million 

Total cache (instructions + data) 

32 KB 

16 KB 

Package 

304-pin QFP 

431 -pin PGA 

Clock frequency 

50 MHz (initially) 

150 to 200 MHz 

Power dissipation 

9 watts @ 50 MHz 

30 watts @ 200 MHz 


Table 2: Implementation characteristics. 


(a) 


PowerPC 601 

Load 

Integer 

Store 

Operate 

Load 

Floating 

Store 

Operate 

Branch 

Integer load 






X 

X 

Integer store 






X 

X 

Integer operate 






X 

X 

Floating load 






x 

X 

Floating store 






X 

X 

Floating operate 

X 

X 

X 

X 

x 


X 

Branch 

X 

X 

X 

X 

X 

X 



(to) 


Alpha 21064 

Load 

Integer 

Store Operate 

Load 

Floating 

Store Operate 

Branch 

Integer Floating 

Integer load 



X 



X 



Integer store 



X 






Integer operate 

X 

X 


X 


X 

X 


Floating load 



X 



X 



Floating store 






X 



Floating operate 

X 


X 

X 

X 



X 

integer branch 



X 






Floating branch 






X 




Table 3r Instruction dispatch rules; (a) In the 60L three mutually compatible 
instructions (marked with X) may issue simultaneously, (h) in the 21064 + two 
compatible instructions may issue simultaneously. Integer branches depend on 
cm integer register, and floating branches depend on a floating register. 



Integer Registers 

Floating Registers 


Read Ports Write Ports 

Read Ports Write Ports 

PowerPC 601 

3 

2 

3 

2 

Alpha 21064 

4 

2 

3 

2 


Table 4: Register file ports. 


data from memory (or by preparing ad¬ 
dresses with fixed-point instructions). Ob¬ 
viously, a short FXU pipeline is desirable. 

In contrast, floating-point instructions 
are processed more slowly. FPU decod” 
ing is not performed in the same dock cy¬ 
cle as dispatching. The first floating-point 
instruction following a branch is likely to 
depend on a preceding load, so the extra 
delay in the floating-point pipeline will 
not affect overall performance significantly. 
This extra delay reduces the interlock be¬ 
tween a foaling load and a subsequent 
dependent floating-point instruction to just 
one clock cycle. 

The buffer at the beginning of the FPU 
can hold up to two instructions; the sec¬ 
ond buffer slot is the decode latch, where 
instructions are decoded. In the FXU 
pipeline, there is a one-instruction decode 
buffer that can be bypassed. The decode 
buffers provide a place for instructions to 
be held if one of the pipelines blocks due 
to some interlock condition or an in¬ 
struction that consumes the execute stages 
for multiple cycles. By getting instructions 
into the decode buffers when a pipeline 
is blocked, the instruction buffers are al¬ 
lowed to continue dispatching instructions 
(especially branches) to nonblocked units. 

21064 Pipelines 

The 21064 pipeline complex is composed 
of three parallel pipelines: fixed-point, 
floating-point, and load/store. The pipe¬ 
lines are relatively deep, and the integer 
and load/store pipes are the same length. 
These are the stages that an instruction 
may go through; 

1. F, Fetch. The instruction cache is ac¬ 
cessed, and two instructions are fetched. 

2. S, Swap. The tw r o instructions are 
directed to either the integer or the 
floating-point pipeline, sometimes .swap¬ 
ping their positions, and branch in¬ 
structions are predicted, 

3. D, Decode. Instructions are decoded in 
preparation for issue—the opcode is 
inspected to determine the register and 
resource requirements of each instruc¬ 
tion. Unlike IBM processors, registers 
are not read during the decode stage, 

4. I, Issue. Instructions are issued and 
operands are read from the registers. 
The register and resource dependen¬ 
cies determine if the instruction should 
begin execution or be held back. After 
the issue stage, instructions are no 
longer blocked in the pipelines, and can 
therefore be completed 

5. A, ALU stage I, Integer adds, jogfeCats, and 
short-length shifts are executed. their re¬ 
sults can be immediately bypassed back, 
so these appear to be single-cycle in¬ 
structions. Longer-length shifts am initiat- 
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ed in Lhis stage, and loads and stores do 
their effective-address add. 

6. B, ALU stage 2. Longer-length shifts 
complete and their results are bypassed 
back to ALU 1 T so these are two-cycle 
instructions. For loads and stores, the 
data cache tags are read. Loads also 
read cache data. 

7. W t Write stage. Results are written into 
die register file. Cache hit/miss Is de¬ 
termined. Data from store instructions 
that hit is stored in a buffer. It will then 
be written into the cache during a cy¬ 
cle with no loads. 

The 21064 integer pipeline relies on a 
large number of bypasses to achieve high 
performance. In a deep pipeline, bypass¬ 
es reduce apparent latencies. There are a 
total of 38 separate bypass paths. 

Floating-point instructions pass through 
F, S, D, and I stages just like integer in¬ 
structions. Floating-point multiply and add 
instructions are performed in stages F 
through K. The floating-point divide takes 
31 cycles for single precision and 6l cy¬ 
cles for double precision. 

Dispatch Rules 

The dispatch rules in the 601 arc quite sim¬ 
ple. The architecture has three units—In¬ 
teger (or Fixed Point), Floating Point, and 
Branch— that can process instructions si¬ 
multaneously. Integer operate instructions 
and ail loads and stores go to the same. 1 
pipeline (FXU), and only one instruction 
of tills category may issue per clock cycle. 

The 21064’s swap corresponds to the 
601 ? s dispatch. Instructions Issue two stages 
later. In Lite 21064, instructions must issue 
in their original program order, and dis¬ 
patch (that is, the swap stage) helps to en¬ 
force this order. A pair of instructions be¬ 
longing to the same aligned doubleword 
Cor “quadword" in DEC parlance) can is¬ 
sue simultaneously. Consecutive instruc¬ 
tions in different doublewords may not 
dual-issue, and if two instructions in the 
same doubleword cannot issue simultane¬ 
ously, the first in the program sequence 
must issue first. 

The 21064 implements separate integer 
and load/store pipelines, and several com¬ 
binations of these instructions may be dual- 
issued (with the exception of integer oper¬ 
ate/floating store, and floating operate/ 
integer store), The separate bad/store unit 
requires an extra set of ports to both the 
integer and floating register files. The 
load/store ports are shared w ith the Branch 
Unit, which has access to all the registers 
because the 21064 architecture has no con¬ 
dition codes, and branches may depend on 
any integer or floating register. Consequently, 
branches may not be issued simultaneous¬ 
ly w ith load or store instructions. 


There are significant 
differences in the way 
the PowerPC and 
Alpha architectures 
handle branches 


Table 3 summarizes the dispatch rules 
for both chips. In the 601 table, an X in 
the corresponding row/column indicates 
that two instructions may simultaneously 
issue. For three instructions, all three pairs 
must have Xs. In the 21064 table, two in¬ 
structions with an X may simultaneously 
issue. 

The ability of the 21064 to dual-issue a 
bad with an integer-operate instruction is 
a definite advantage over the 601. Many 
applications (not to mention the operat¬ 
ing system) use very little floating point; 
the 21064 can execute these apps with 
high efficiency, but the 601 can execute 
only one integer instruction per clock cy¬ 
cle (while its FPLT sits idle). 

Register Files 

The 21064 and 601 have register files w ith 
almost the same number of ports; see 
Table 4. Both start with one w T rite and two 
read ports to service operate instructions. 
The 21064 provides an additional pair of 
read/write ports for load/store unit data. 
Branches share the load/store register 
ports, which brings the count up to 3R 2W 
for both integer and floating-register files. 
One additional integer read port is need¬ 
ed to get the address value for stores and 
loads. Doing an integer store in parallel 
with an integer operate involves an extra 
integer read port, but not allowing a 
register-pi us-register addressing mode 
saves a register-read port. 

The 60Ts one write and two read ports 
for operate instructions are fortified by an 
additional integer read poll for single-cycle 
processing of store w ith index instructions, 
which read tfiree registers (two for the ef¬ 
fective address, one for the result). An ex¬ 
tra integer write port allows the result of an 
operate instruction and data returned from 
the cache to be w ritten in the same clock 
cycle. The same consideration accounts for 
two write ports in the floating-register file. 
Trie three floating-point read pons accom¬ 
modate the combined floating multiply/add 
instruction. 


Data Caches 

The 21064 uses separate instruction and 
data caches. The data caches are small, 
(8 KB) direct-mapped data caches de¬ 
signed for very fast access times; see Fig¬ 
ure 1(a). The address add consumes one 
clock cycle. During the next dock cycle, 
the Translation Lookaside Buffer (TLB) is 
accessed and the cache data and tag are 
read. In a direct-mapped cache this is easy 
because only one tag must be read, and 
the data, if present, can only be in one 
place. The TLB address translation com¬ 
pletes in die thiid cycle, and the tag is com¬ 
pared with the upper address bits. A cache 
hit or miss is determined about halfway 
through this dock cycle. The data are al¬ 
ways delivered to the registers as an 
aligned, 8-byte doubleword, Alignment, 
byte selecting, and the like must be done 
■with separate instructions. 

In the 601, the unified data/instaiction 
cache is much larger—32 KB—and is 8- 
way set associative, yielding a higher hit 
rate than the 21064. Figure 1(b), shows 
how much more Ji work n the 601 does in 
a clock cycle. It does an address add and 
the cache directory/TLB lookup in the 
same cycle. During the next cycle, it ac¬ 
cesses the 32-byte-wide data memory and 
selects and aligns the data field. 


Plant Trees for America 



T en free Colorado Blue Spruces, 
will be given to each person 
who joins The National Arbor Day 
Foundation. To become a member 
and receive your free trees, send a 
$10 membership contribution to 
Ten Blue Spruces, National Arbor 
Day Foundation, 100 Arbor 
Avenue, Nebraska City, NE 6S410. 

Join today, and plant your Trees 
for America! 

The National 
Arbor Day Foundation 
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601 & 21064 


(a) 

(b) 

double x[512], y[512] ; 

for [k = 0; k < 512; k++l 

x[k] - (r*x[k] + t*y [k]); 

ft rl points to x 
ft r2 points to y 
ft r6 points to the end y 
fp2 contains t 
ft fp4 contains r 
ft r5 contains the constant I 


LOOP; Idt 

fp3 = y(r2,0) 

ft load floating double 


Idt 

fpl - x ( rl,0) 

# load floating double 


mult 

fp3 = fp3,fp2 

# floating multiply double t*y 


addq 

r2 = e2 *8 

ft bump y pointer 


mult 

fpl - fpl.fpA 

ft floating multiply double, r*x 


subq 

rk - r2 t r6 

ft subtract y end from current pointer 


addt 

fpl = fp3 * fpl 

ft floating, add double, r*x+t*z 


stt 

x(rl,0) - fpl 

ft Store floating double to x(k) 


addq 

rl = rl,8 

# bump x pointer 


bne 

r4,LOOP 

ft branch on r4 ne 0 


Example 1: Alpha 21064pipelined processing example, (a) C code; 
(b) assembly code . 



1 2 

3 

4 5 

6 7 

8 9 10 11 

12 13 14 15 16 17 

18 19 20 

21 

22 

Idt fpS^CrZ.S) 

F S 

D 

I A 

B W 
















Idt fpl=X(rl.0) 

F . 

S 

D I 

A B 

W 















mult fp3=fp3.fp2 


F 

S D 

* I 

F 

G 

H 

K 

I 

tf 










addq r2“r2.8 


F 

S D 

, I 

A 

B 

tf 













mult fpl=fpl«fp4 



F S 

* D 

I 

F 

G 

H 

J 

K 

W 









eubq r4=r2.r6 



F S 

* D 

I 

A 

B 

W 












addt fpi-fp3*fpl 



F, 

S D 


. 




I 

F 

G 

H 

J 

K 

w 




stt xfrl.0)=fpl 



F. 

S D 

. 

. 








I 

A 

B 

W 



addq rl=rl.8 




F 

S 










D 

I 

A 

B 


bne r4.loop 




F 

s 










a 

I 

A 

■ 

, 

Idt fp3=y(r2.0) 






F 








* 

s 

D 

I 

A 

B 

Idt fpl=x(rl.0) 






F 

■ 







* 

* 

S 

D 

I 

A 


Example 2: 21064 pipeline flow for Imp example 


Issue time 


LOOP: Idt 

fpB =* y (r2,0) 

ft 

load y[k] 

0 

Idt 

fpl = x(rl,0) 

ft 

load x[k] 

1 

Idt 

fp7 = y (r2,8) 

ft 

load y[k+lj 

2 

Idt 

fp5 - x(rl,8) 

ft 

load x[k+l] 

3 

mult 

fp3 = fp3.fp2 

ft 

t*y[k] 

4 

Idt 

fpil = y (r2.16) 

ft 

load y[k+2] 

4 

mult 

fpi = fpl,fp4 

ft 

r*x [k] 

5 

Idt 

fp9 = x(rl, 16) 

ft 

load x[k+2] 

5 

mult 

fp7 = fp7 * fp2 

ft 

t*y[k+1] 

6 

Idt 

fp15 = y(r2,24) 

ft 

load y[k+3] 

6 

mult 

fp5 = fp5.fp4 

ft 

r*x[k+1] 

7 

Idt 

fpl3 * x{rl,24) 

ft 

load x[k+3j 

7 

mult 

fpll = fpll.fpZ 

ft 

t*y[k+2] 

8 

addq 

r2 » r2,32 

ft 

bump y pointer 

8 

mult 

fp9 - fp9 p fp4 

# 

r^x[k+2] 

9 

subq 

r4 = r2.r6 

ft 

remaining y size 

9 

mult 

fpl 5 = fpl5, fp2 

ft 

t*y[k+3] 

10 

mult 

fpl3 = fpl3,fp4 

ft 

r*x[k+3] 

11 

addt 

fpl = fp3 , fpl 

ft 

r*x [k]+t*y [k] 

12 

addt 

fp5 = fp7 , fp5 

ft 

r*x[k+l]+t*y [k+1] 

13 

addt 

fp9 = fpll,fp9 

ft 

r*x[k+2] +t*y[k+2] 

15 

stt 

x(rl*0) = fpl 

ft 

store x[k] 

16 

addt 

fpl3= fpl5 , fpl3 

ft 

r*x[k+3]+t*y[k+3] 

17 

stt 

x(rl,8) = fp5 

ft 

store x[k+l] 

17 

stt 

x(rl,l6) - fp9 

ft 

store x[k+2] 

19 

stt 

3t£f 1.24) = fpl3 

ft 

store x[k+3] 

21 

addq 

rl = rl,32 

ft 

bump x pointer 

22 

bne 

r4 , LOOP 

ft 

next loop 

22 

LOOP: Idt 

fp3 = y(r2,0) 

ft 

next iteration 

23 


Example JL* Example loop, unrolled for the Alpha 21064. 


The 601 gets more done in fewer stages, 
but the 21064's clock cycle is about a third 
to a fourth the length of the 60 l’s. Conse¬ 
quently, the 601 T s two clock cycles take 
much longer than die 21064’s three cycles. 

Example of Pipeline Flow 

Example 1 shows a For loop in C and its 
corresponding 21064 assembly-language 
code. Note in this and subsequent exam¬ 
ples that the notation, bit numbering and 
assembly language do not conform to that 
of Alpha; they have been modified to be 
consistent with PowerPC notation. Exam¬ 
ple 2 is the 21064 pipeline flow for the ex¬ 
ample loop. It shows in-order issue, dual- 
issue for aligned instruction pairs, and the 
relatively long six-clock-period floating¬ 
point latency. After the I stage, instruc¬ 
tions never block. 

The importance of the swap stage is 
clear from the first two instructions, which 
cannot dual-issue because both are loads. 
The second instruction is held for one cy¬ 
cle while the first moves ahead. The first 
dual-issue occurs for the first addq-muit 
pair. Because mult is the first instruction 
in the doubleword, addq must wait, even 
though no dependencies hold it back. The 
sequence of dependent floating-point in¬ 
structions paces instruction issue for most 
of the loop. Note that the floating store 
issues in anticipation of the floating-point 
result It waits only four—not six—dock 
periods for die result so that it reaches its 
write stage just in time to have the 
floating-point result bypassed to it. 

A bubble follows the predicted branch 
at the end of the loop. Because other in¬ 
structions in the pipeline are blocked, 
however, by die time the Idt following die 
branch is ready to issue, the bubble is 
“squashed.” That is, if die instruction ahead 
of the bubble blocks and the instruction 
behind proceeds, the bubble is squashed 
between die two and eliminated. 

Overall, the loop takes 16 clock peri¬ 
ods per iteration in steady state. (The first 
Idt passes through I at time 4; during the 
second iteration, it issues at time 20.) In 
comparison, the 601 takes six (longer) 
clock periods. 

Floating-point latencies are a major per¬ 
formance problem for die 21064 when i\ 
executes this type of code. Also, in-order 
issue prevents the loops from “telescop¬ 
ing” together as they would in the 601— 
there is very little overlap among consec¬ 
utive loop iterations, and the small amount 
that occurs is mostly due to branch pre¬ 
diction. Each parallelogram in Figure 2 il¬ 
lustrates the general shape of the pipeline 
flow for a single loop iteration. 

In the 601, die branch processor elim¬ 
inates the need for branch prediction, and 
the out-of-order dispatch, abng with mul- 
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tiple buffers placed at key points, tele¬ 
scopes the loop iterations. Telescoping in 
the 601 is limited by the lack of store 
buffer in die FPU, which otiier imple¬ 
mentations may choose to provide. Hie 
RS/6000, for example, has register re¬ 
naming, deeper buffers, and more bypass 
paths; it achieves much better telescop¬ 
ing than die 601 

Software pipelining or loop unrolling 
are likely to provide much better per¬ 
formance For a deeply pipelined imple¬ 
mentation like the 21064 The DEC com¬ 
pilers unroll loops. Example 3 shows die 
unrolled version of Example 2. The ex¬ 
ample loop is unrolled four times. The 
clock period at which instructions pass 
through the I stage is shown in the right- 
hand column. Now, in steady state, four 
iterations take 23 clock periods (about 
six per iteration), more than three times 
better than the rolled version. Unrolling 
also emphasizes the performance ad¬ 
vantage of dual-issue. 

Loop unrolling also improves the per- 
formanee of die 601, as Example 4 shows. 
After dispatching in the 601, instructions 
may be held in a buffer or in the decode 
stage if the pipeline is blocked. Hence, 
we show FXU and FPU decode time, and 
BU execute time (which is the same cy¬ 
cle in which a branch is decoded). 

Assume that the loop body is aligned 
in the cache sector, Eight instructions are 
fetched, and instruction fetching can keep 
the instruction buffer full until time 2; af¬ 
ter that, the cache is busy witii load in- 
stnictions. The instruction queue becomes 
empty and the pipeline is starved for in¬ 
structions, but these cannot be fetched 
until time 9, when the cache finally be¬ 
comes available. At this time, die six re¬ 
maining instructions of the cache sector 
are fetched (the first two were fetched at 
time 2). 

The unrolled loop (four iteradons) lakes 
20 dock cycles (five dock cycles per loop 
iteration versus six in the rolled version). 

Branch instructions 

There are significant differences in the 
way the PowerPC and Alpha architectures 
handle branches; see Figure 5. The Pow¬ 
erPC has a special set of registers designed 
to implement branches. Conditional 
branches may test fields in the Condition 
Code Register and the contents of a spe¬ 
cial register, the Count Register, A single 
branch instruction may implement a loop- 
closing branch w hose outcome depends 
on both the Count Register and a Condi¬ 
tion Code value. Comparison instructions 
set fields of the Condition Code Register 
explicitly, and most arithmetic and logi¬ 
cal instructions may optionally set a con¬ 
dition field by using the record bit. 


In the Alpha, conditional branches test 
a general-purpose register relative to zero 
or to odd or even. Thus, a test can be 
performed on the result of any instruc¬ 
tion. Comparison instructions leave their 
result in a general-purpose register. 

Certain control-transfer instructions save 
the updated program counter and use it 
as a subroutine return address. In the Al¬ 
pha, these are special jump instructions 
that save the return address in a general- 
purpose register. In the PowerPC, tiiis is 
done in any branch by setting the Link 
(LK) bit to 1, and saving the return ad¬ 
dress in the Link Register. 

The Alpha also implements a set of 
conditional move instructions that move 
a value from one register to another, but 
only if a condition, similar to the branch 
condition, is satisfied. These conditional 
moves eliminate branches in many sim¬ 
ple, conditional code sequences- see Ex¬ 
ample 5. A simple If-Then-Else sequence 
is given in Example 5(a), A convention¬ 
al code sequence appears in Example 
5(b); Lhe timing shown is for the best- 


case path, assuming a correct prediction. 
Example 5(c) uses a conditional move. 
While the load is being done, both shifts 
can essentially be performed for free. The 
shift 4 is tentatively placed in register r3 
to be stored to memory. If the test of a 
is True, dien the conditional move to c 
replaces die value in rj w ith the shift 2 
results. The total time is shorter than the 
branch implementation (even in the best 
case) and does not depend on branch 
prediction. 

In general, branch target addresses are 
determined in the following ways: 

* Adding a displacement to the program 
counter (PC relative). Available in both 
architectures, 

* Absolute. Available only in the Power¬ 
PC, where the displacement is inter¬ 
preted as an absolute address if the Ab¬ 
solute Address (AA) bit is set to 1. 

* Register indirect. Available for instruc¬ 
tions not shown in Figure 3. These are 
die XL-form conditional branches in die 
PowerPC and jump instructions in 


(a) 



(b) 

Latch Latch 



Figure 1: Cache access paths, (a) Alpha 21064; (h) PowerPC 601 , 
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Instr. 

PXU 

FFU 

BU 




fetch 

decode 

decode 

exec. 




time 

time 

time 

time 

& CTR 

- 12B (loop count A) 





LOO?: lfs 

fpfl = y (r3,205?) 

tt load y[kl 

0 

1 



lfs 

fp4 = y [r3.2056) 

tt load y[k+l] 

0 

2 



lfs 

fpG ° y{r3,2060) 

# load v[k+2] 

0 

3 



fnulls 

fp0 - fpO.fpl 

» t*ylk] 

0 


4 


lfs 

fpB = y U3,2064) 

* load y [k+3] 

0 

4 



fmills 

fp4 - fp4,fpl 

# t*ylk+l] 

0 


5 


lfs 

fp2 = x(r 3 A) 

# load x[k] 

0 

5 



fmul a 

fpd 21 fpG,fpl 

0 t*y[k+2] 

0 


6 


lfs 

fp5 - x(r3*S) 

# load x(k+i] 

2 

6 



fmills 

fpfl = fp8,fpl 

# t+y[k+3] 

2 


7 


lfs 

fp7 - x(r3,12} 

# load x[k+2] 

9 

10 



fnrndds fpfl = fp0,fp2*fp3 

# r*x[k] + t*y[k] 

9 


11 


IfB 

fp9 - x(r3,16) 

# load x(k+3] 

9 

11 



fmadds fp4 = fp4,fp5.fp3 

# r*x [k+lj + t*y(k+l] 

9 


12 


fmadds fp& ■ fpG,fp7,fp3 

D r*x [k+2] + t*y [k+2] 

9 


13 


fmadds fp8 = fp8*fp9*fp3 

* f*x[ k+3] + t+ylk+3] 

9 


14 


stfe 

x[t3+4) - fpfl 

& store x[k] 

10 

14 

15 


StfE 

xtr3+&) - fp4 

tt store x [k+jH 

10 

15 

16 


rtfs 

x(r3+12) - fp6 

# store x [k+2] 

10 

IS 

17 


stf&u 

x(r3=r3+16) - fpB 

It store x[k+3] 

10 

17 

IS 


be 

LOOP, CTR$\neq i $0 

fr dec CTR. branch If CTR t 0 

n 



15 

LOOP: lfs 

fpfl - y(r3,2052) 

# load y[kj 

20 

21 




Example 4" Example loop t unrolled for the PowerPC 601 . FXU instructions are 
dispatched and decoded in the same clock cycle: 


(a) 

if (a 

== 1) c 

= b << 1\ 



else c 

= b << 

4; 

it initially, assume 
tt rl contains b, 

Issue time 

<b> 






# r7 points to a, 

# rfl points to c. 




Idl 

r2 = a(r7,0) 

# load a from memory 

0 



empeq 

r5 - r2,1 

tt test a 

3 



beq 

r5.SEFT2 

tt branch if a=l 
tt assume taken 

4 



all 

b r3 = rl, 4 

tt shift b << 4 




br 

& STORE 

tt branch uncond 6 



SHFT2: 

all 

fir r4 = rl* 2 

# shift b << 2 


<c) 

STORE: 

stl 

fit r3 = c(r8*0) 

# store c b 6 

Issue time 




tt initially* assume 
tt rl contains b. 
tt r7 points to a, 
tt r8 points to c. 






Idl 

b rl = afr7,0) 

tt load a from memory 

0 



all 

& r3 = rl.4 

tt shift b << 4 

1 



all 

b r4 = rl * 2 

tt shift b << 2 

2 



empeq 

6 r5 ^ r2.1 

tt test a 

3 



emov 

fir r3 = r4,r5 

tt conditional move to 

c 4 



etl 

b r3 = c(rB,0) 

tt store c 

4 


Example 5: Alpha 21064 conditional-move example, (a) C code; (b) assembly 
code imth conditional branch; (c) assembly code with conditional move , 


(a) (b) 



Figure 2: Comparison of loop overlap in (a) 21064- and (b) PotverPC 601-like 
implementations . 


the Alpha. General-purpose registers in 
the Alpha arc used, and the Count Reg¬ 
ister and Link Register are used in the 
PowerPC. 

Both processors predict hranches to re¬ 
duce pipeline bubbles. The 601 uses a 
static branch prediction made by the com¬ 
piler. Also, as a hedge against a wrong 
prediction, the 601 saves the contents of 
die instruction buffer following a branch- 
taken prediction until instructions from 
the taken path are delivered from memo¬ 
ry; thus, the instructions on the not-taken 
path are available immediately if a mis¬ 
prediction is detected. 

The 21064 implements dynamic branch 
prediction with a 204^-entry table; one en¬ 
try Ls associated with each instruction in 
the instruction cache. The prediction table 
updates as a program runs and contains 
the outcome of the most recent execution 
of each branch. This predictor is based on 
the observation that most branches are 
decided the same way as on their previ¬ 
ous execution. This is especially true for 
loop-closing branches. 

This type of prediction does not always 
work well for subroutine returns, howev¬ 
er. A subroutine may be called from a 
number of places, so the return jump is 
not necessarily the same on two consec¬ 
utive executions. The 21064 has special 
hardware to predict the target address for 
return- from-subroutine jumps. The com¬ 
piler places the lower 16 bits of die return 
address in a special field of die jump-to- 
subroutine instruction. When this instruc¬ 
tion is executed, the return address is 
pushed on a four-entry prediction stack, 
so return addresses can lie held for sub¬ 
routines nested four deep, Hie stack is 
popped prior to returning from the sub¬ 
routine, and the return address is used to 
prefetch instructions horn the cache. 

Conditional-Branch Pipeline Flow 

We are now ready to step through the 
pipeline flow for the Alpha conditional 
branches; see Figure 4, 

The swap stage of the pipeline exam¬ 
ines instructions in pairs. After die branch 
instruction is detected and predicted, it 
takes one dock cycle to compute the tar¬ 
get address and begin fetching, which may 
lead to a one-cycle bubble in the pipeline. 
The pipeline is designed to allow squash¬ 
ing of this bubble. In the case of a si¬ 
multaneous dispatch conflict, as in Figure 
4(a), the instruction preceding the branch 
must be split from it anyway, so the 
branch instruction waits a cycle and fills 
in the bubble naturally. If the pipeline 
stalls ahead of the branch, the bubble can 
lie squashed by having an iastruction be¬ 
hind die branch move up in the pipe, if 
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the bubble is squashed and the predic¬ 
tion is correct, the branch effectively be- 
comes a zero-cycle branch. 

Figure 4(b) shows the incorrect- 
prediction case. The branch instruction 
registers are read during issue stage. Dur¬ 
ing the A stage, the register can be tested 
and the correctness of the prediction de¬ 
termined quickly enough to notify the in¬ 
struction-fetch stage if there is a mispre¬ 
diction, Then, the correct path can be 
fetched in the next cycle. As a result, four 
stages of the pipeline must be flushed if 
the prediction is incorrect. For the jump- 
to-subroutine instruction, die penalty for 
a misprediction is five cycles. 

For branches, the biggest architectural 
difference between the Alpha and the 
PowerPC is that the Alpha uses general- 
purpose registers for testing and subrou¬ 
tine linkage, while the PowerPC uses spe¬ 
cial-purpose registers held in the Branch 
Unit. This allows it to execute branch in¬ 
structions in the Branch Unit immediate¬ 
ly after they are fetched In fact, the Pow¬ 
erPC looks back in the instruction buffer 
so that it can execute, or at least predict, 
branches while they are being fetched. 
The Alpha implementation, in contrast, 
must treat branch instructions like the oth¬ 
er instructions. They are decoded in the 
D pipeline stage, read registers in I, and 
executed in the A stage. 

Table 5 compares die branch penalties 
for integer-conditional branches (far more 
common than floating-point branches). 
The penalties are expressed as a function 
of the number of instructions (distance) 
separating the condition determining in¬ 
struction (compare) and the branch from 
the correctness of the prediction. The 
compare-ten branch instru ction count is 
significant only in the 601, however In¬ 
struction cache hits are assumed. 

In tile 21064, correctly predicted branch¬ 
es usually take no clock cycles. They take 
one clock cycle when a bubble created in 
the swap stage is not later squashed. The 
601 has a zero-cycle branch whenever 
there is enough time to finish the instruc¬ 
tion that sets the condition code held pri¬ 
or to the branch and to fetch new in¬ 
structions. This may take two dock cycles: 
one to execute the compare instruction, 
and one to fetch instructions from the 
branch target. This second dock cycle may 
be saved when a branch is mispredicted 
buL is resolved before overwriting the in¬ 
struction buffer; instructions may be dis¬ 
patched from the buffer right after deter¬ 
mining that the branch was not taken. 
With a two-instruction distance, the 601 
has a zero-cycle branch even if it was mis¬ 
predicted; the 21064 always depends on 
a prediction, regardless of the distance. 

The PowerPC requires fewer branch 


predictions in the first place; see Table 6. 
In the 601, all loop-closing branches that 
use the CTR register do not have to be 
predicted; in the Alpha these are ordinary 
conditional branches, although loop- 
closing branches are easily predictable. A 
subroutine return must read an integer 
register in the Alpha, so these branches 
are predicted via the return stack. The 
PowerPC can execute return jumps im¬ 
mediately in the Branch Unit; there is no 
need for prediction. 

Tables 5 and 6 show that accurate 
branch prediction is much more critical in 


the 21064. Not only does the 21064 pre¬ 
dict more of the branches, the penalties 
tend to lie higher when it is wrong. For 
this reason, the 21064 has much more 
hardware dedicated to the task—history 
bits and the subroutine return stack. The 
Alpha architecture also reduces the penal¬ 
ty for a misprediction by having brandi¬ 
es that always test a register against zero; 
testing one register against another would 
likely Lake an additional clock cycle. 
Some doubt the PowerPC method of 
using special-purpose registers for branch¬ 
es because they present a potential bot- 



Figure .>■ Branch instructions, (a) Conditional branches; (b) unconditional 
branches. 


(a) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 

A FSDIABW 

Branch F . S D I A 

B FSDIABW 

C F . S D I A B W 

(b) 0 1 2 3 4 5 6 7 B 9 10 11 12 13 

A FSDIABW 

Branch F . S D I A 

B FSDIXXX 

C F .3DXXXX 


X FSDIAB W 

Y F.SDIAEW 


Figure 4; Timing for conditional branches in the Alpha 21064 , (a) Instruction 
flow for correct branch prediction; (b) instruction flow for incorrect branch 
prediction. (X means instruction is flushed as a result of branch misprediction. ) 
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fieneck. We think not These registers al¬ 
low many branches to be executed quick¬ 
ly without prediction and are important 
for supporting loop telescoping. 

Memory Architecture arid 
instructions 

The Alpha is a 64-bit-only architecture. 
The PowerPC has a mode bit, and im¬ 
plementations may come in either 32- or 
64-bit versions; die 601 is a 32-bit version. 
All 64-bit versions must also have a 32- 
bit mode. The mode determines whether 
the condition codes are set by 32- or 64- 
bit operations. 

The Alpha defines a fiat or linear, virtual- 
address space and a virtual address whose 
length is implementation dependent with¬ 
in a specified range. The PowerPC sup¬ 
ports a system-wide, segmented virtual- 
address space in either 32- or 64-bit 
mode. Differences between the two 
modes affect the number of segments 
and their size, which also results in a dif¬ 
ference in Lhe virtual-ad dress space (52 
bits versus 80 bits). 

Currently, software developers and ar¬ 
chitects seem to favor flat, virtual-address 
spaces, although the very large segments 



Figure 5: Format for PAL instructions 
used to define operating-system 
primitives. 


available in the PowerPC shouldn't pre¬ 
sent many problems, The Alpha was de¬ 
fined as a 64-bit architecture from the start, 
so developers can easily provide a flat 
virtual-address space. The POWER archi¬ 
tecture, however, was defined with 32-bit 
integer registers that were also used for 
addressing. This presented the POWER ar¬ 
chitects with a dilemma: Either use a fiat, 
32-bit virtual-address space (which would 
likely be too small in the very near future) 
or encode a larger address in 32 bits. Such 
an encoding led to the segmented archi¬ 
tecture inherited by the PowerPC. Also, 
and perhaps more importantly, the single, 
shared-address space facilitates capability- 
based memory-protection methods simi¬ 
lar to those used in IBM’s AS/400 com¬ 
puter systems. 

The Alpha architecture specification 
does not define a page-table format* Be¬ 
cause TLB misses are handled by trap¬ 
ping to system software, Alpha systems 
using different operating systems may 
have different page-table formats. Two 
likely alternatives are VAX/VMS and 
OSF/1 UNIX* A Privileged Architecture 
Library (PAL) provides an operating- 
system-specific set of subroutines for 
memory management, context switching, 
and interrupts. The Alpha instruction set 
includes the format in Figure 5 for PAL 
instructions used to define operating- 
system primitives. 

The Call PAL instructions are like sub¬ 
routine calls to special blocks of instruc¬ 
tions, whose locations are determined by 


one of five different PAL opcodes. A PAL 
routine has access to privileged instruc¬ 
tions but employs user-mode address trans¬ 
lation. While in the PAL routine, interrupts 
are disabled to assure the atomicity of priv¬ 
ileged operations that take multiple in¬ 
structions. For example, if one instruction 
turns address mapping off, an interrupt 
should not occur until another instruction 
can turn it back on. The details of virtual- 
address translation and page-table format 
are a system-software issue to be defined 
in the context of tile particular operating 
system using PAL functions. 

Figure 6 compares the format of mem¬ 
ory instructions. The format of instructions 
using tlie displacement-addressing mode 
is identical in file PowerPC and Alpha, lhe 
effective address is calculated in the same 
way in both architectures except for the 
register with the value 0, which is regis¬ 
ter 0 in the PowerPC and register 31 in 
the Alpha* There is no indexed address¬ 
ing in file Alpha. As previously mentioned, 
this saves a register read port. 

Another Alpha characteristic is that load 
and store instructions transfer only 32- or 
64- bit data between a register and mem¬ 
ory; there are no instructions to load or 
store 8-bit or 16-bit quantities. The Alpha 
architecture does include a set of instruc¬ 
tions to extract and manipulate bytes from 
registers. This approach simplifies the 
cache interface so that it does not have to 
include byte-level shift-and-mask logic in 
the cache access path. 

In Example 7, the core of a strcpy rou¬ 
tine moves a sequence of bytes from one 
area of memory to another; a byte of ze¬ 
ros terminates the string. The Idq is a 
load-unaligned instruction that ignores 
the low-order three bits of the address; 
in the example, it loads a word into r/, 
addressed by r4. The extract byte (extbl) 
instruction uses the same address, r4, 
but only uses the three low-order bits to 
select one of the eight bytes in rh The 
byte is copied into r2. To move the byte 
to s, the sequence begins with another 
load unaligned instruction to get the 
word containing the destination byte* 
The mask byte (maskhi) instruction uses 
the three low-order bits of rj (the ad¬ 
dress of s) to zero out a byte in the just- 
loaded r5. Meanwhile, the insert byte 
(insbl) instruction moves the byte from 
t into the correct byte position, also us¬ 
ing the three low-order bits of the ad¬ 
dress in r3 The bis performs a logical 
OR operation that merges the byte into 
the correct position, and the store un¬ 
aligned (stq_u) instruction stores the 
word back into s. The t and s pointers 
are incremented, the byte is checked for 
zero, and the sequence starts again if die 
byte is nonzero. 


(a) 

PowerPC 601 
Alpha 21064 


11 


16 


(b) 

PowerPC 601 


Alpha 21064 Register + register addressing is not available. 


31 


ORCD 

RT 

RA 


D 


0 

6 

11 

16 


31 

OPCD 

RT 

RA 


D 


Effective address= < 

f (RA) + DifRA*0 
[ D if RA-0 



Effective address= < 

f (RA) + D if RA*31 
l D if RA=31 



0 

6 11 

16 21 


31 

OPCD 

RT 

RA 

RB 

EO 

Rc 

Effective address^ 1 

[ (RA) + (RB) if RA*0 
[ D if RA=1 


Figure 6c Memory instruction format, (a) Load- and store-instruction format 
using register + displacement addressing. The displacement D is sign extended 
prior to addition. In Alpha , D is multiplied by 2 16 if OPCD = LDAH. RT is the 
destination register, (b) load- and store- instruction format using register + 
register (indexed) addressing. RT is the destination register. 
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ATLANTA 



Alpha 21064 

Correct Incorrect 

PnwArPC fifll 

Distance 

Correct 

Incorrect 

0 

0-1 

4 

0 

2/1 

1 

0-1 

4 

0 

1/0 

>2 

0—1 

4 

0 

0 


Table 5; Branch penalties . 



Conditional Branches 

Loop-closing 

Subroutine 


(non-loop-closing) 

Branches 

Returns 

PowerPC 601 

Static prediction 

Always zero-cycle 

Always zero-cycle 

Alpha 21064 

Dynamic prediction 

Dynamic prediction 

Stack prediction 


Table 6: Prediction methods versus branch type , 


ldq_u 

rl = t(r4,0) 

# A string is copied from t to s 

# r4 points to t 

J 1 r3 points to s 

If load t, unaligned 

extbl 

r2 = ri.r4 

# extract byte from rl to r2 

ldq_u 

r5 = fl(r3,0) 

# load s, unaligned 

maskbl 

r5 = r5.r3 

it zero corresponding byte in r5 

insbl 

r6 = r2 * r3 

# insert byte into r6 

bis 

r5 =* r5 f r* 

ii logical OR places byte in r5 

siq_u 

s(r3.0) = r5 

# store unaligned 

addq 

r4 = r4,i 

fr bump the t pointer 

addq 

r3 = r3.1 

if bump the s pointer 

bne 

r 6,LOOP 

lr branch if nonzero byte 


Example 7; Alpha 21064 strcpy junction (null-terminated strings). 


Operate Instructions 

The basic operations performed by both 
architectures are rather similar. One dif¬ 
ference is the combined floating- point 
multiply-add in the PowerPC, This in¬ 
struction requires three floating-point reg¬ 
ister read pons. The 21064 has three such 
ports but uses them for stores so that a 
floating-point operate can be done si¬ 
multaneously with a floating point store; 
this can! be done in the 601. 

The Alpha architecture does not have 
an integer-divide instruction; it must be 
implemented in software. Leaving out in¬ 
teger divides, or doing them in clever ways 
to reduce hardware, seems to be fash¬ 
ionable in RISC architectures, however, it- 
era live dividers are cheap, and one can 
expect that all the RISC architectures will 
eventually succumb to divide instructions 
(some already have). 

The Alpha architecture has scaled inte¬ 
ger adds and subtracts that multiply one 
of the operands by 4 or 8—one of the 
few Alpha features that seems non-RISCy, 
These instructions are useful for address 
arithmetic In which indices of word or 
doubleword arrays are held as element 
offsets, then automatically converted to 
byte-address values for address calcula¬ 
tion using the scaled add/subtracts. The 
Pow erPC has a richer set of indexing op¬ 
erations embedded in loads and stores as 
well as the update version of memory in¬ 
structions. 


Conclusion 

We have just seen that the PowerPC 601 
and Alpha 21064 represent two distinct 
design philosophies. The 601 implements 
an instruction set containing more pow¬ 
erful instructions. And, it uses an imple¬ 
mentation that provides considerable 
flexibility to enhance detection and ex¬ 
ploitation of parallelism by the hardware. 

Of course, iliis results in more-complex 
hardware control. The Alpha 21064 uses 
a very streamlined instruction set and im¬ 
plementation, While not appearing as 
clever as the 601, the simplicity of the im¬ 
plementation contributes to a very fast 
dock rate—much faster than any other 
commercial microprocessor. 

As a final note, follow-on processors 
from DEC and the PowerPC consortium, 
die Alpha 21164 and PowerPC 604, con¬ 
tinue the differing design philosophies, 
'fire 21164 can Issue more instructions per 
cycle than the 21064, but its pipelines are 
still relatively simple, and it has a very fast 
clock. 

The 604 T on the other hand, is even 
more aggressive than the 601 when it 
comes to providing hardw are mechanisms 
for increasing parallelism—although, as 
one would expect, this comes at the ex¬ 
pense of hardware control complexity. 
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High-Performance 

Programming for 

the PowerPC 


S queezing the best performance out 
of a processor requires both insight 
and experience. When it comes to die 
PowerPC microprocessor family, how¬ 
ever, programmers are just starting to 
understand die architecture and each pro¬ 
cessor's implementation. The PowerPC ar¬ 
chitecture specification defines both required 
and optional features for any processor im¬ 
plementation, Each implementation of the 
PowerPC- architecture specifics tion—the 
601, 603, 604, and 620—may have a very 
different set of features. For example, cache 
size and type, bus width, power- manage¬ 
ment capabilities, and number of execu¬ 
tion units can vary between each part. 
However, compliance widi the PowerPC 
architecture specification ensures binary 
compatibility across each processor 
implementation. 

In this article, well examine meth¬ 
ods for measuring performance and 
techniques for improving the speed 
and efficiency of applications run¬ 
ning under Windows NT for Power- 

Kip uvrks in Motorola 's RISC software 
group writing tow-level PowerPC 
software and is the author of Power¬ 
PC Programming for Intel Program¬ 
mers (IDG Books, 1995). He can he 
contacted at kip_mcclanahan@rise 
sps mol.com. Mark has been hack¬ 
ing operating - system ikernels for over 
ten years and is currently on the team 
responsible for the PowerPC port of 
WindowsNT. He can be contacted at 
markv @ fisc.sps.mot.com. Mike is 
manager of the compiler and tools 
development group at Motorola in 
Austin , Texas He can he contacted 
at ph %Uip@risc, sps. m ot. co m , 


Avoid performance 
pitfalls when coding 
for Windows NT 


Kip McClanahan, 
Mike Phillip, and 
Mark VandenBrink 



PC. In doing so, we ll point out some of 
the pitfalls associated with Little-endian 
operating systems (such as Windows NT) 
and present an application that demon¬ 
strates the effect of byte alignment on per¬ 
formance. Well also look at some opti¬ 
mization techniques that apply more 
generally to die PowerPC architecture. 

Measuring Performance 

Perhaps the most difficult step in im¬ 
proving performance is simply getting 
started. There are several ways to analyze 
performance for a particular application, 
but ifs almost always necessary to narrow' 
the scope of the analysis to factors that 
can be controlled by die programmer Per¬ 
formance is typically affected by: 


* System hardware. 

* System software. 

* Application design/algorithms, 

* Compiler/tools/build configuration. 

System hardware issues include 
the size and speed of memory and 
disk subsystems, the type of video 
cards and the amount of dedicated 
video memory, and the type and 
speed of the microprocessor itself. 
For most developers, it is important 
to characterize die impact of the sys¬ 
tem hardware and software on per¬ 
formance, but most opportunities to 
improve performance lie in how the 
application itself it was built. System- 
software issues are too numerous to 
list, but tend to center around the 
operating-system and networking 
configuration of the computer. 

Most applications are too large 
to examine in their entirety, but 
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performance-analysis tools such as pro¬ 
filers can often identify those regions of 
code in which most time is spent. If pro¬ 
filing tools are not available, you can gain 
insight into any possible performance bot¬ 
tlenecks by thoroughly inspecting the code 
and answering the following questions: 

* Does the application access a lot of disk- 
based data? 

* Are die program data types integer, float¬ 
ing point, or both? 

* Does the program frequently access 
large blocks of memory (such as arrays 
of data) or frequently allocate memory? 

Improving Performance 

Regardless of profiling information, a good 
place to start improving performance is 
to turn on compiler optimizations. Most 
modern compilers offer sophisticated 
code optimization that can be invoked by 
the user. Although such opLions will like¬ 
ly slow down compilation, the resulting 
application should execute much faster, 
often with speedups of 200 to 300 per¬ 
cent. Compiler optimization controls are 
akin to stereo-system controls, however: 
Just as increasing the volume to its max¬ 
imum level may not be optimal for each 
piece of music, simply setting the default 
optimization flag is unlikely to yield op¬ 
timal performance for every software ap¬ 
plication. Many compilers offer specific 
optimization flags for fine-tuning appli¬ 
cation performance. While such opti¬ 
mizations may not apply to enough ap¬ 
plications to warrant inclusion in the 
default optimization settings, they can 
greally improve the performance of a par¬ 
ticular application. 

One such option is automatic inlining 
of subroutines. Although not much of an 
optimization by itself, subroutine inlining 
exposes more code to the optimizer in the 
context in which it will be used. Of course, 
blindly induing code is unwise, because 
replicating the subroutine body increases 
code size, which can actually decrease per¬ 
formance through loss of code locality in 
the memory .system. However, when used 
in conjunction with profiling feedback, in¬ 
lining small, heavily called routines often 
improves performance significantly lor the 
overall application. 

WMe selectively inlining C or C++ sub¬ 
routines may improve performance, inline 
assembly code should be embedded with 
caution. When assembly code is inlined 
into a high-level program, the compiler 
must typically make conservative as¬ 
sumptions about register and memory us¬ 
age, which can throttle many potential op¬ 
timizations. But inlining critical PowerPC 
assembly instructions (such as synchro¬ 
nization primitives or status-register ac- 


Increased awareness 
of the interactions 
between the 
operating system and 
the microprocessor 
is critical to 
performance 


cesses) can improve performance. Some 
compilers, including those developed by 
Motorola, provide a set of built-in intrin¬ 
sic functions that provide efficient access 
to low-level system instructions without 
incurring the performance penalty asso¬ 
ciated with inlining seemingly “random" 
assembly instructions. 

Language and Design Considerations 

The most significant factor affecting soft¬ 
ware performance is the design and im¬ 
plementation of the application itself 
While algorithm design Is application spe¬ 
cific and clearly beyond the scope of this 
discussion, several general design con¬ 
siderations affect performance for virtual¬ 
ly any PowerPC software application. 

Stick to the standards, languages such 
as C and C++ have well-defined standards 
intended to ensure the portability of source 
code among different development envi¬ 
ronments. Many providers of development 
tools offer nonsLandard language exten¬ 
sions that differentiate their products. 
While often alluring to die developer, these 
extensions can easily tie an application to 
a particular tool set, and, to a lesser ex¬ 
tent, a particular target architecture. Many 
of these features can significantly degrade 
performance on RISC microprocessors like 
a PowerPC. 

Watch out for misalignment Exam¬ 
ples of such language extensions are the 

_ unaligned keyword and ^pragma pack, 

both of which can affect data alignment 
For Little-endian implementations of an 
operating system such as Windows NT, 
misaligned accesses can lead to significant 
performance losses on the PowerPC ar¬ 
chitecture. Many modern microprocessors, 
including most PowerPC implementations, 


are optimized to handle properly aligned 
memory references, often at the expense 
of handling relatively infrequent misaligned 
references. Compilers typically will align 
data on its “natural" alignment boundary, 
where the address of the object is an ex¬ 
act multiple of the size of the object in 
bytes. However, certain programming prac¬ 
tices, including tile imprudent use of some 
language extensions, can create a situation 
where the compiler must access memory 
in chunks smaller than the natural size of 
an ohject. For example, if you use the 

_ unaligned keyword in Microsoft C/C++ 

to inform a compiler that an object is mis¬ 
aligned, a PowerPC compiler will typical¬ 
ly be required to load the corresponding 
object from memory one byte at a time. 
For a 32-bit integer object, this requires 
lour memory accesses instead of one, plus 
three additional rotate instructions. 

A similar situation can arise when the 
compiler is instructed to “pack” the ele¬ 
ments of a structure via source-code prag¬ 
mas, On older architectures, biduding the 
Intel 80x86 family, such language asser¬ 
tions do not necessarily affect perfor¬ 
mance. However, given the relatively high 
cost of servicing an alignment exception, 
PowerPC compilers will typically opt to 
generate conservative, albeit slower, code 
For known, potentially misaligned access¬ 
es. Removing the_ unaligned keyword 

or structure-packing pragmas does not 
necessarily solve die problem. In fact, it 
may lead to incorrect code or even worse 
performance. To avoid these performance 
pitfalls, it's best to avoid such extensions 
when designing an application. The align¬ 
ment example shown in Listing One (list¬ 
ings begin on page 38) demonstrates the 
performance penalties associated with the 
three common techniques for misalign¬ 
ment resolution. 

Alignment exceptions can also be cre¬ 
ated through mismanagement of pointers 
in C and C++ programs, or by accessing 
data through a reference that is not of the 
same natural alignment as the original ob¬ 
ject; for example, accessing a series of 
characters or half-word objects through 
integer variables. For Windows NT appli¬ 
cations, diese misalignment exceptions are 
often hidden from the user through pro¬ 
grammer Instructions Lo die operating sys¬ 
tem. As die alignment example in this ar¬ 
ticle demonstrates, it is clearly preferable 
to enable exceptions as a means of locat¬ 
ing performance losses than to “hide" them 
from the user. 

Structured Exception-Handling 
Issues 

Structured exception handling can also 
adversely affect performance. While diey 
(continued on page 34) 
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(continued from page 30) 
are an elegant and maintainable means of 
managing the interaction between the ap¬ 
plication and the operating system when 
errors or unexpected events occur, ex¬ 
ception handlers can also restrict die com¬ 
piler's ability to safely optimize code. This 
is true even for code not directly includ¬ 
ed in the exception handler itself. Thus, 
exception handlers should be carefully 
designed to minimize the performance im¬ 
pact, as follows: 

* Isolate the exception-handling code as 
much as is practical. Since the flow of 
program control to an exception han¬ 
dler and its corresponding effect on op¬ 
timization is often nonintuitive, excep¬ 
tion handlers should not lie placed in 
the middle of large, otherwise unrelat¬ 
ed subroutines, particularly if they are 
performance critical. Since most com¬ 
pilers optimize a subroutine at a time, 
isolating the actual exception handler 
within a small subroutine (that is per¬ 
haps called by a larger enclosing sub¬ 
routine) can help limit performance 
degradation. 

* Avoid introducing pointers and global 
variables within exception-handling 
code. The semantics of most exception¬ 
handling language constructs typically 
necessitate saving and restoring 'live” 
data across the scope of an exception 
handler, since the compiler often does 
not know which data an exception will 
affect. By limiting the exposure of glob¬ 
al variables and pointers within an ex¬ 
ception handler, the compiler can often 
better reduce memory accesses within 
the corresponding region of code. 

Target-Specific Issues 

Target-specific factors can often be uti¬ 
lized to maximize performance. Although 
fine-tuning for one processor or architec¬ 
ture at the expense of others should be 
avoided, a Few PowerPC-specific issues 
should be considered to maximize Pow¬ 
erPC performance. 


Memory Subsystem 

The frequency and speed of memory ac¬ 
cesses are almost always key factors in 


Alignment Size 

Form of 32-bit 
Address 

8-bit 

XXX x xxxx 

16-bit integers 

xxxx xxxO 

32-bit integers and 

single-precision FP 

xxxx xxOO 

64-bit integers and 

double-precision FP 

xxxx xOOO 


Table l: Natural alignment 
boundaries. 


overall application performance. Com¬ 
pared to mast PowerPC-processor oper¬ 
ations, off-chip memory references are ex¬ 
tremely expensive (particularly if the 
accesses are misaligned). You can often 
maximize the utilization of on-chip and 
secondary off-chip caches by carefully 
managing large data structures such as ar¬ 
rays. Since all PowerPC chip implemen¬ 
tations utilize associative cache designs, a 
slight change in the “stride” of array ac¬ 
cesses can significantly affect the effec¬ 
tiveness of a cache. A simple guideline is 
to avoid anay sizes that are an exact mul¬ 
tiple of the cache size ( typically powers 
of 2 in the range of 8-64 KB). However, 
predicting cache behavior through such 
simple guidelines is precarious, at best, 
since actual dynamic reference patterns 
vary between applications. The guidelines' 
intent is to avoid allocating heavily refer¬ 
enced variables to the same set of cache 
addresses. Profiling tools can often pro¬ 
vide feedback concerning a program’s 
cache-utilization efficiency. 

A Real-World Alignment Example 

To make the following alignment exam¬ 
ple more meaningful, wee’ll tie it to an op¬ 
erating system. Because Windows NT for 
the PowerPC is a Little-endian operating 
system, it is subject to die alignment re¬ 
strictions described earlier. Remember, a 
multibyte access performed ai an address 
not aligned with the size of the access 
(known as “natural alignment”) w ill cause 
an alignment exception. Table 1 shows 
the natural-alignment boundaries tor mem¬ 
ory accesses of various sizes. 

When a Little-endian PowerPC applica¬ 
tion performs a memory access that is not 
aligned on its natural boundary, an align¬ 
ment exception will result. When align¬ 
ment issues exist on PowerPC-based sys¬ 
tems, they can be resolved in three ways: 

* If there is no support for data-alignment 
management, the operating system traps 
on the alignment exception, usually ter¬ 
minating the "faulting” application. This 
may seem the worst of all possible out¬ 
comes, but it can be very helpful dur¬ 
ing the development cycle; see die dis¬ 
cussion that follows. 

* The operating system can take the align¬ 
ment exception, perforin the necessary 


fix-ups to transparently handle the mem¬ 
ory access, and return as if nothing had 
ever happened. And while abstracting 
the problem from both user and pro¬ 
grammer may seem like the best solu¬ 
tion, it is one of the worst, The transi¬ 
tion through the alignment-exception 
mechanism is comparably slow and one 
of die worst performance killers. 

* You can use die unaligned type qual¬ 
ifier, the ^pragmapackd) directive, or 
macros on accesses with known align¬ 
ment problems, Each of these tech¬ 
niques breaks a single multiple-byte ac¬ 
cess into individual, byte-wise accesses 
in order to eliminate alignment Issues. 

Trap and Terminate 

Under Windows NT, it is possible (and the 
default for PowerPC Windows NT) not to 
have any support for misaligned data. 
When there is no support for misalign¬ 
ment and your application accesses data 
on an unnatural boundary, an alignment 
fault is generated. The Windows NT align¬ 
ment fault handler is configured not to fix 
misaligned accesses and will display a 
pop-up message and terminate your ap¬ 
plication. This seems like the one situa¬ 
tion to avoids but in fact die operating sys¬ 
tem is doing you a favor. 

The ability to trap an alignment excep¬ 
tion and terminate the faulting application 
is valuable when porting code, particu¬ 
larly when porting Windows NT applica¬ 
tions from 80x86 to PowerPC architectures. 
And the ability to detect alignment ex¬ 
ceptions is fundamental to handling mis¬ 
alignment efficiently. 

Operating System Fix-ups 

The Windows NT kernel can be config¬ 
ured to perform misaligned data fix-ups 
upon detection of an alignment excep¬ 
tion. This means that the alignment- 
exception handler must break the multi¬ 
byte memory access that caused the 
exception into individual byte accesses, 
which are not constrained by alignment 
issues. Although this sounds like a ter¬ 
rific service, it is the most inefficient so¬ 
lution, Even if the rest of your applica¬ 
tion is well constructed, a few OS-based 
alignment fix-ups can bring performance 
to its knees. As Table 2 shows, for 5 mil¬ 
lion misaligned memory references, the 



Table 2: Timing values generated by Listing One. Sample tests performed on a 
100-MHz 604 running Windows NT 3 51 (build 1057X compiled with Motorola £ 
NT compiler and averaged over six runs of the program. 
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(continued from page 34) 
difference between OS handling and pro¬ 
grammatic handling is nearly 30 seconds! 
Put another way, having the OS fix-up 
misaligned memory accesses is 43 times 
slower than the same number of aligned 
accesses. 

If this solution is so slow, why is it 
around? Of course, you wouldn't want 
your released and shipping applications 
to terminate the first time they have an 
unexpected alignment fault—compatibil¬ 
ity with 80x86 applications would be re¬ 
duced significantly, OS-based fix-ups are 
a reasonable first-pass solution for Power¬ 
PC applications that have not specifically 
taken data alignment into consideration. 
Just as the number of 32-bit applications 
is slowly increasing in the 80x86 world, 
so will tiie number of PowerPC applica¬ 
tions that make data alignment a priority. 

Processes can enable OS-based align¬ 
ment fix-ups using the call SetErrorMode 
(SEM_NOAUGNMEPLTFA ULTEXCEPT). A 


child process inherits its parent's error 
mode, so any processes created by your 
application after enabling this mode will 
also suffer the performance penalty for 
misaligned data. Under Windows NT for 
PowerPC and MIPS processors, OS-based 
misalignment support is disabled by de¬ 
fault, The Alpha version of NT enables 
OS-based misalignment support by de¬ 
fault; Alpha NT applications must turn this 
feature off if programmatic solutions are 
used. Setting the 5EM_NO ALIGNMENT - 
FAULTEXCEPT flag has no effect on x86 
processors. 

Programmatic Solutions 

The first programmatic solution to data 

misalignment is the_ unaligned pointer 

type qualifier. When the compiler sees a 
pointer reference (this qualifier works only 
with pointers) declared using the_ un¬ 

aligned type qualifier, it includes sufficient 
code to ensure that the memory access 
does not generate an alignment exception. 


In particular, it breaks multiple byte ac¬ 
cesses into individual byte accesses, A 
single-alignment constrained, 32-bit load 
or store instruction is replaced by seven 
instructions that perform the same open 
ation using only byte accesses. Similarly, 
a single lb-bit memory reference would 
be replaced by three instructions. Listing 
Two shows a single, 32-bit, PowerPC 
store-word instruction. If the store-word 
instructions in Listings One and Two were 

performed using an_ unaligned pointer 

qualifier reference, the instruction would 
be converted to the seven instructions 
shown in Listing Three. 

Listing Three represents compiler- 
generated code and, taken out of the con¬ 
text of the original flow of code, may ap¬ 
pear suboptimal. Figure 1 depicts the 
operation of the seven instructions of List¬ 
ing Three, The first instruction stores the 
low-order byte (0x78) of r3 into the ad¬ 
dress contained in r4. The rlwinm in¬ 
struction is used to rotate the bytes with¬ 
in rj such that each subsequent store-byte 
instruction references the proper value. 
The value contained in rj is in Big-endi¬ 
an format, and r4 points to Little-endian 
memory. Therefore, the bytes must be 
swapped inro Little-endian format during 
the store operation. 

Another programmatic solution, the 
*pragma }xick() directive, is particularly 
appropriate for porting between the var¬ 
ious Windows NT platforms. One poten¬ 
tially recurring problem results from data 
structures and formats that were never de¬ 
signed from a portability perspective. In 
particular, graphics formats such as BMP 
and DIB do not address natural-boundary 
alignment. This is understandable: When 
these file formats were created for 80x86 
software (such as Microsoft Windows), 
alignment issues were not a big concern— 
alignment was nice, but misaligned ac¬ 
cesses didn't kill performance. With the 
advent of RISC processors, fixed-length 
instruction size, and the associated align¬ 
ment restrictions, data misalignment lias 
become an important issue. 

The ^pragma pack() directive tells the 
compiler two things. First, pack structure 
elements as close together as passible. In 
particular, avoid using the standard 
(aligned) structure padding. Second, die 
compiler knows to generate additional 
code to support misaligned accesses for 
elements within the “packed" structure, 

much like the effect of the_ unaligned 

qualifier. In Listing One, a standard BMP 
header is packed, and the misaligned ac¬ 
cess is performed using the 32-bit BMP- 
DataOjfsel element. This addresses appli¬ 
cation portability concerns because it 
allows the programmer to guarantee that 
the offsets within a native PowerPC struc- 


General-Purpose Register r3 
(Big-endian format) 


Most 

Significant 



Least 

Significant 

Byte 


Figure 1 : Multi-byte store into Little-endian memory. 


((define rULONG(x) (UL0NG)( 



\ 

*(UCHAR *)(&x) } 



\ 

(♦((UCHAR *){6x)+l) 

<< 

8) ! 

\ 

(♦((UCHAR *)(«*)+2) 

<< 

i6) ; 

\ 

{*((UCHAR ♦)(&x)+3) 

<< 

24) ) 


((define rUSHORX(x) (USHORT) ( 



\ 

*(UCHAR ♦H&x) ! 



\ 

(♦((UCHAR 

<< 

8)) 



Figure 2: Macros to break an access into bytes. 


-0 Use ONLY aligned accesses. 

-1 NO alignment fix upe (causes an exception). 

-2 Use OS-based fix ups for misaligned accesses. 

-3 Use _UNALIGNED type qualifier. 

-4 Use #PRAGMA FACK{1) directive. 


Figure 3: Options for the second parameter to the ALIGN program. 
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turn will exactly match those defined in a 
native 80x86 structure. 

Finally, there may be well-defined times 
when you know that you're about to per¬ 
form a misaligned access. Instead of per¬ 
manently packing a structure or declaring 
an_ imaligned pointer variable to refer¬ 

ence the memory, you can use the macros 
shown in Figure 2 to break die particular 
access into its byte components. To use 
the macros, simply bracket each mis¬ 
aligned memory reference with the ap¬ 
propriate macro. 

The [Misalignment Demonstration 
Program 

Three categories of misalignment reso¬ 
lution are demonstrated in the alignment 
program (ALIGN.C) shown in Listing 
One. This program lets you compare the 
performance impact of misaligned data 
by timing both aligned and misaligned 
accesses. 

When timing a fixed number of oper¬ 
ations in a preemptive, multitasking 
operating system, such as Windows NT, 
it Is important to minimize noise in your 
timing measurements. To do so, elevate 
your thread to the highest priority possi¬ 
ble and take multiple data samples. List¬ 
ing One sets the thread- timing priority to 
THREAD_PRIORITY_TIME_CRITlCAL. 
r fhis increases the accuracy of the mis¬ 
alignment timing by reducing the number 
of interrupts that can influence the over¬ 
all time required to complete the set of 
operations. In fact, when running ALIGN, 
the usefulness of your mouse will decrease 
dramatically. 

To obtain the liming values, die GetTime- 
SiampO routine simply uses Query - 
PerfomianceFrequeticy() and QueryPer- 
formanceCounterC) to sample the Win32 
high-frequency counter. The time report¬ 
ed by ALIGN is derived by taking the dif¬ 
ference between the time stamp before 
and after each set of memory-access op¬ 
erations. 

ALIGN requires two parameters: an it¬ 
eration count and one of the parameters 
in Figure 3- For example, the staggering 
28-second measurement was generated 
using ALIGN 5000000 -2, which per¬ 
formed five million misaligned accesses 
using the OS to handle the alignment fix¬ 
ups, Listing One was used to generate the 
timing values shown in Table 2, which 
dearly demonstrates the cost of data mis¬ 
alignment. 

Conclusion 

While die increased availability of perfor¬ 
mance analysis tools and continual ad¬ 
vancement of compiler optimization tech¬ 
niques will accelerate the tuning process, 
many key performance issues will remain 


embedded within the design and imple¬ 
mentation of an application. An increased 
awareness among PowerPC developers of 
tile interactions between the operating sys¬ 
tem and the microprocessor is critical to 
avoiding performance losses due to mis¬ 
aligned memory references, poor utiliza¬ 
tion of structured exception handling, and 
inefficient application of compiler opti¬ 
mizations. 

For Little-endian-system implementa¬ 
tions such as Windows NT, die means by 
which alignment issues are resolved can 
dominate application performance, as 
demonstrated in the ALIGN example of 
Listing One. Most importantly, many of 
the concepts in this article affect perfor¬ 


mance not only for PowerPC systems, but 
for otiier architectures, as welL 
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HIGH PERFORMANCE 


Listing One 

l Windows NT fat PoWEtPC Alignment DetBonst tat ion Program 


Mark Vandenflrink, nLarkv@risc.ape.iDDt.com 

Kip NcClanahan, kip_racc3anahan@riBC,sp0,wt,eow -or- kip'ticL-Gon 

Mike Phillip, phillip@risc,spa,mot.com 



* include (stdlib.h> 

•include <stdio,h> 

•include (stdarg.h> 

•include (windowE,h> 

•include <winioctl,h> 

•include {wiring.h> 

•include <ctype.h> 

•include <mebury.h> 

// force compiler fix-ups for data accesses within the 
ff BMP structure by using • pragma psck(l). 

•pragma pack(l) 

// Standard Windows:*.x BMP file header 

// 

typedef struct BMFHeader [ 

(ISHORT FileType: 

ULOMG FileSiae; 

U5H0RT reservedIt 
USHORT teaarvedl; 

ULONG BMPDataQffset; 

1: 

struct BMPHeader bmpBuffer; ff declare structure 

// 

// Print; an error message to to the screen end exit. 

if 

static VOID 

Die (char 'format, ., , ) 

t 

ve_list va; 

va_etert(vs, format); 
fprintffstdarr, MI \n\nALIGK: "); 
vfprintf(stderr. format, va) ; 

ExitFrocess(2)t 


format 


// offset 0 
// offset 2 
t / offset 6 
// offset 8 
// offset 10 
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U 

ff Return a timestamp from the high frequency performance counters (if 
// one exists). Return the stamp in units of number of milliseconds 

ff 

static UINT 
GetTimeStsmp (VOID) 

f 

static DWORD FreqTnHs = B; 

LARGE.INTEGER Time; 

if f!FteqlnMs) t 

if (CtoeryPerfonnanceFteqijeney (STime) = TRUE) E 
if (Tiae.llighPart) { 

Die("Timer ties too high s resolutionW); 

) 

// 

// 100-nanosecond units 

// 

FreqlnMe - Tine.LowPart / 1000; 

) else { 

Die [ "Could not get frequency of perfotance counterVn"); 

) 

3 

if (QueryFerformanceCounterUTime) == FALSE) [ 

Die("System does not support high“resolution timerW 1 ): 

) 

return Time.LowPart / FreqlnMa: 

1 

if 

U Essentially useless function that returns a value to place at 
ff IntPointer. Function used to prevent compiler from optimizing 
// away references to *IntPointer inaide a loop, 

ff 

DWORD 

GetNextValue(VOID) 

t 

static DWORD NextValue - 0; 
return NextVaiue*+: 

1 

main(int srge. char *+argv) 

( 

CHAR. Buffer [1M4]; 

HINT EfidTime; 

HINT Max = 0; 

HINT StartTime; 

uirrr \ : 

struct BXPHedder theaderPtrj 
int *IntPointer2; 

..unaligned int ^IntPointer; 

switch (arge) [ 
case 3: 

ff 

if Note; setting thread's priority to THREAD„PRIDRITY-TIME^CRITICAL 
ff can effectively bring a machine to its knees, depending on the 
ff process priority class. 

SetThreadFriority(GetCurrentThreod{), 

THREAD-FRIOR ITYJTCfflLCR IT1 CAL 

u 


Hax = fltttqul(argv[1], 0, 0); 

// 

// The naturally aligned case 

if (argv [2| [0] = At argvEJim “ '*') t 
printf ("ONLY aligned referonceaW); 

IntPointer2 * (int *)(ABuffer [4j); 

printf ("Buffer at IntPointet - ^SsSn", BiUffer, IntPginter2) t 
StartTime = GetTimeStampO ; 
for ti ±= 0; 1 < Max; i++) [ 

*rntPointer2 ■ GetNextValued; 

) 

EndTinse = GetTimeStampf); 
break; 


// The no fix-ups, alignment exception causing case 

If (argv[2H0] “ AA argv[2] [i] « '!') E 

printf("ND support for misaligned references'^"); 


listFointera: = (int *) (SBuffer [3]); 

printf ("Buffet at %x, IntPointer » %x’in l, 1 Buffer^ IntPointer2); 
UtartTime = GetTineStampf): 
for (i — 0; it Max; i++) ( 

*IntFointerZ ■ GetNextValua(); 

1 

EridTime = GetTimeStampO; 
break; 


ff GS-hased fix-ups 
ff 

if (argv[2] [01 — l - r AA argv[2Hl] ™ '2') ( 

printf (''OS support of misaligned referencesAn"): 


EetErrorKode (EEM.WOALIGmmFADLmCRFT): 
IntPointer2 ■ (int *)(4Bufferl3J): 


printf ("Buffet at IntPointer ■ Kx\n’ r , Buffer. IntFointer2); 
StartTime = GetTipeStamp(): 
for (± = 0; i < Max; i++) f 

*IntFointer2 = GetHextValueO: 

1 

EndTime = GetTimeStaropO ; 


38 


Dr. Dobb’s Sourcebook, September/October 1995 




































break; 


if (argv[2j [0] = '- H &A argv[2JU3 " *3') C 
printf (Tfiing _UNWJGNED qualifierAn"} ; 

IntPointer - (int *)({.Buffer 13 3); 
pr int£ (" Buffer at %x, IiltPointer “ lx\n", 
Buffer. 

IntPointer}: 

StartTime = GetTimeotamp (): 

For (i = 0: i < Me*: i++) { 

♦IntPointer = GetNextValueO; 

} 

EndTime = GetTineStampO: 
break i 

} 

If (argv[2] [0] = h -’ if. atgv[2][l] — 1 4 h j { 
headerPtt = (struct BMPHeader ♦)Buffer: 
pr itftf { H Using tfpragma paek{l) directiveW); 


printf("Actess offset tfcx\n B . (DLCNGjt(headerPtr->MPDataO£fset)): 
StartTine = GetTimeStampf); 
far (I - 0E 1 C He*: 1++) ( 

head erPt r->BHFDataQf fset - Get l*ext Valued : 

) 

EndTime - GetTira Stamp 0; 


} 


break: 

1 

// 

H fall through 

// 

default t 
fprintf[stderr, 
fprintf(stderr. 
fprintf{stderr, 
fprintf(atderr. 
fprintffstderr, 
fprintf(atderr, 
fprintf(stderr, 
EiitProcEsa(0J: 


"Usage: ALIGN number-of-iterations [-option]\it"); 
1f \m?berE option is one of the following:\n“); 
m \t-0 Use QffLTf aligned accesses, \n") : 

*\t-l NO alignment fix tips (causes an exception)„\n*): 
"\t-2 Use OS-baaed fix ups for misaligned accesses. W); 

’At-3 Use_IMALICNED type qualifier. W) i 

"\t-4 Use PRAGMA PACX(l) directive.\n*J; 


1 

printff"Xd miIliseconda\n tt . EndTime - StnrtTiue); 
ExitProcess[0); 


Listing Two 


: Typical 32-bit store instruction 
5 Assumes: 

: r3 contains word to store at address contained in r4 

Btv r3 r 0{r4) ; store word contained in r3 

; to address contained in r4 + 0 

Listing Three 


The equivalent 3Z-bit store resulting from use of the 
. _unallgned type qualifier in the pointer declaration 
for IntPointer. 

Assumes; 


r3 contains word to store at address contained in r4 
For the purposes of this example, assume that 
r3 = 0x12345678. 


r3, 0(r4} ; store the lower byte f@x78) 

; of r3 into address contained 
r in r4 ♦ 0, 

r5, t3,24.8,31 j extract bits 16“23 into the 

l low-order position of r3 

Haw the rlvinra instruction works; 

Step 1: rotate contents of r3 left by Z4 bits 
Hesuit: 0x71123456 

Step 2: generate a aask with 1-bits from 
bit 8 to 31 Result; 



Step 

3: AND the contents of r3 with &ask mi 



place the result into r5- 



Result: r5 

= 0x00123456 


NOH: 

the next stb 

instruction will store 



0x56 into the 

address (r4 + I). 



See Figure 1 - 


stb 

rS. 

1Ce4) : 

store next byte at e4 + 1 

rlwirnn 

r5. 

r3.1&.lfi.3l ; 

extract bits 8-15 into r5 

stb 

rS. 

2(e4) : 

store next byte at c4 + 2 

rlwinm 

ri. 

r3,8,24,31 : 

extract bits 0-7 into r3 

stb 

r3. 

3{r4) : 

Store final byte at f4 + 3 


End Listings 
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C MACROS 


Bit Operations with 
C Macros 


E ndian refers to a processor ad¬ 
dressing model Lhat defines the 
byte ordering of data and instruc¬ 
tions stored in computer memo¬ 
ry, The most common addressing 
models are Big-endian (left-to right order) 
and Little-endian (right-to-left), Intel-based 
processors (80x86, Pentium, and the like) 
are Little-endian, while others (such as the 
Motorola 680x0 in the Macintosh) are Big- 
endian. Still others, particularly the Power* 
PC, are “Bi-endian," allowing them to mn 
in either Big-endian or Little-endian mode 
(see the accompanying text box entitled 
“PowerPC Bi-Endian Capabilities," by 
James R. Gillig). 

As straightforward as this sounds, “en¬ 
dianness" can be confusing for pro* 
grammers-—particularly when de¬ 
veloping portable software running 
on a variety of platforms. To ad¬ 
dress this confusion, IVe developed 
an "endian engine" which handles 
every byte order, 'ibis engine is pre¬ 
sented in my article “Your Own En¬ 
dian Engine," (Dr DobbS Journal, 
November 1995). The heart of the 
engine is a powerful set of C 
macros that perform bitwise oper¬ 
ations, which I II discuss in this ar¬ 
ticle, It's noteworthy that the ex¬ 
amples ! implement to handle these 
C macros are designed to handle 
instructions for MMIX, a hypothet¬ 
ical computer developed by Don¬ 
ald Knuth (see the accompanying 
text box entitled “MMIX: KnutlVs 
New Computer"); 


John is a programmer in the Seattle 
area. He can he contacted on 
CompuServe at 72634,2402. 


And Knuth’s MM1X 
as a bonus! 


John Rogers 


I invented some of the macros myself 
and modeled others on Fortran functions. 
Together, they comprise a complete set 
of macros for manipulating bits. Since 
ANSI C says the value of a right-shifted 
negative number ^implementation-de¬ 
fined," IVe defined all of the macros for 
operands with unsigned integral types, just 


to be on the safe side. Listing One, 
bitops.h, has all of the C macros discussed 
in this article; listings begin on page 44. 
The complete source code to accompany 
this article is available electronically; see 
“Availability," page 3. 

Except for MVBITS, all of the macros 
In bitops.h return values rather than up¬ 
dating parameters; see, for instance, the 
ALL_ZERO_BITS(type) macro in Example 
L The companion macro, ALL_ONE_BITS 
(type), is analogous. 

Bit Numbering 

Many of the macros here use bit num¬ 
bering, By convention, the least signifi¬ 
cant hit (LSB) is bit 0, Some of the macros 
indicate one or more contiguous bits in a 
value, using the convention of a 
start-bit number and a length in bits. 
You give the bit number of the low¬ 
est bit in die range of bits that you 
want, For instance, to use bits 0 
through 3, give a start-bit number 
of 0 and a length of 4, 

Conveniently, a variety of speci¬ 
fications and standards adhere to the 
LSB convention of numl:>ering as bit 
0. Most Intel processors, the IEEE 
MUFOM (microprocessor universal 
format for object modules), and the 
MtLSTD FORTRAN functions all use 
this convention. The only major ex¬ 
ception is IBM mainframes, which 
number the most significant bit 
(MSB) as bit 0. 

Fortran-Inspired Macros 

Since at least the 1970s, many ver¬ 
sions of Fortran have included a 
standard set of bit functions that in¬ 
clude the usual operations: AND, 
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OR, NOT, and exclusiveOR, These For- 
Iran functions also include routines for bit 
extraction, insertion, shift, and circular 
shift. Rather than reinvent the wheel, Fve 
used the same function names and 
operand orders. However, since the For¬ 
tran functions were designed for imple¬ 
mentations with just one integer type, and 
C has many sizes of integer types (rang¬ 
ing from char to long), I added, where 
necessary, an additional parameter (at the 
end) to indicate the data type to be re¬ 
turned. This must be some unsigned in¬ 
tegral type. 

As for the Fortran-inspired macros, Fll 
start with i\OT( value, type) (bitwise com¬ 
plement), sometimes called the “flip bits” 
or 11 invert B operation. In Fortran, the 
NOTXvalue) function has one operand (an 
integer value) and returns an integer the 
inverted value. The C NOT(value ? type) 
macro has an additional operand, which 
must be an unsigned integral data type. 
The NOTfvalue, type) macro converts the 
given value to the given type and returns 
the converted value with all of the bits in¬ 
verted. For instance, in an implementation 
with 8-bit characters, NOTXOxFO, unsigned 
char) would be OxOF. 

The lANDCm^iJype) (“integer and”) 
macro simply performs a bitwise-AND of 
the bits in the first two operands, which 
are treated as type type, and returns the 
result. It supports any unsigned integer 
type for its operands. Kenneth Hamilton 
used the Fortran version of this macro in 
his article “Direct Memory Access from 
PC Fortrans" (Dr Dobb’sJournal, May 
1993). Hamilton's code needed to extract 
the low byte from some integer value. Us¬ 
ing the C macros in bitops.h, the equiv¬ 
alent would be: 

unsigned int icl; 

icl=iAND(ic,255,unsigned int); 

There is an integer extract bits (IBITS) 
macro that extracts and right-justifies one 
or more contiguous bits from a given 
value. I BITS is called as IBITS( value, 
bitnumftenjype). For instance, IBITS 
(Qx5678 t 8 f 4,unsigned long) returns an 
unsigned-long value of 0x6. 

Another Fortran-iaspired macro is the 
integer shift (ISHFT) macro, a call to which 
appears as ISIIIBfvalue, shifts, type). ISHFT 
and its circular-shift companion ISIEFTC 
are unique in that they indicate the di¬ 
rection to shift by positive or negative val¬ 
ues of die shifts parameter, A positive value 
for shifts causes a left shift by that many 
bits; a negative value causes a right shift 
by that many bits; a 0 value causes no 
shift. Make sure the absolute value of shifts 
is less than or equal to the size of type in 
bits; otherwise, the result is undefined. 


For a Fortran version of the ishft rou¬ 
tine, see Ray Duncan's U l6-Bit Software 
Toolbox" column ( Dr. Dobb ’$ journal, Au¬ 
gust 1985). 

Unlike the other macros in bitops.h, 
MVBITS updates a value (using a pointer 
passed to it) rather than returning a val¬ 
ue. The call MVBlTS(src,srcindex, len t dest- 
ptr f destindex, type) updates die value at 
*destptr (starting at bit destindex for ten 
bits) with ten bits extracted from src start¬ 
ing at bit number srcindex. Example 3 
show's an example of using MVBITS. 

BitOps Examples Using MMIX 

In the examples from here on, 111 use the 
bitops.h C macros to handle instructions 
for MMIX. The parts of an MMIX instruc¬ 
tion are multiples of 8 bits each, but the 
bitops.h macros don't depend on this. 


A normal MMIX instruction Is 32 bits 
long and broken into four fields of 8 bits 
each; see Table 1. Three fields generally 
refer to registers or contain immediate val¬ 
ues. Knuth refers to these fields as X, Y, 
and Z. In some other instructions, Knuth 



Table 1: MMIX normal instruction 
layout * 


unsigned short x; 

x - ALL_ZERQ_BITS(unsigned short)? 


Example 1: Simple C macro. 


#include "mmixcom.h" MMDC_Qpcode_T, etc. */ 

MMULInstrJT Current_Instruction? 

MMIX_Opcode_T Current_Opcode; 


/* Assume CurrentsInstruction has 

already been set. */ 

/* IBITS right-justifies result, 

bo use it to extract opcode. */ 

Cur rent^Qpcode = (MMDLOpcode-T) 

IBITS( 

Cur rent,Instruction * 

/* value */ 

mmix_instr_ofcode_start. 

fa start bit num */ 

MMIX.INSTR^OPCOBE^LEN, 

/* leu */ 

MMIX_Instr_T); 

/* type */ 


Example 2: Extracting an opcode using the IBITS macro. 

MMlX.InstrJT HfcW-Instr = , 

ALL _ZERO.BITS(MMIX_Instr_T): 

/* Set opcode. */ 

MVBITSt 

0xC2. 

/* ADDU opcode */ f+ src */ 

0 . 

/+ src index: src bit 0. */ 

MMIX_INSTR_OPCODE_LEN. /* len */ 

ANew^Instr, 

fa deet ptr */ 

MMIX_INSTR.OPCODE.START, /* dest index */ 

MMIX_Instr_T) ; 

/* type */ 

/* Set X {target) field to 

say r40. */ 

MVBITS ( 

40, 

/* register 40 #/ /* src */ 

0, 

/* src index: src bit 0* */ 

MMIX_INSTR_X_LEN, 

fa len */ 

&New_Instt, 

/* deet ptr */ 

MMIX_ INSTR JE_ START * 

/* deet index */ 

MMIX„Instr_T) ; 

fa type */ 

/* Set Y (a source field) 

to r4l. */ 

MVBITS ( 

41, 

/* register 41 */ src */ 

0. 

/* arc index: src hit 0 , */ 

MMIX_INSTR_Y_LEN. 

/* len */ 

&New_Instr, 

/* dest ptr */ 

MMIX_INSTR_Y_START. 

/* dest index */ 

MMIX-Instr.T) ; 

/* type */ 

/* Set Z (the other source 

field) to r42. */ 

MVBITS ( 

42. 

/* register 42 */ /* src */ 

0, 

/* src index; src bit 0. */ 

MMIX_INSTELZ_LEH, 

/* len */ 

&New^Instr, 

/* dest ptr */ 

MMIX_INSTR_Z_START, 

fa dest index */ 

MMIX^lnstr.T) ; 

/* type *1 


Example 3: Creating an instruction with the MVBITS macro , 
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C MACROS 


#include "nnnixcom.li" /* MMIX_Word_T, MMIX_WORD_LEM p etc. */ 
MMIX_Word_T 

Sim_ERUC /* Simulate shift right unsigned inetr. */ 
MMIX_Word_T Source.Reg, 

HMIX_tford_T Shift.Count_Reg) 


[ 


if (Shift,Count_Reg >= MMIX_WORD_LEN) 
return ( 0 ); 

return (RIGHT_SHIFT_EITS( 

Source„Reg, /* value */ 

Sbift_Count_Reg, /* shifts */ 

MMIX_WORD_LEN. /* len */ 

MMIX.Word_T)); /* type */ 


Example 4: Using the EIGHT_SHIFT_BITS macro to simulate an SHU imtmeHorh 


MKIX_Word_T 

Sim_XOR( /* Simulate exclucive-OR hits instr. */ 
MHIX_Word_T Some.Bita. 

MMIX_Word_T Other_Bits) 

[ 

return (XEOR( 

Some .Bits, 

Other_Bits f 

MMlX.WortLT)) ; /* type */ 


Example 5: Using the IEOH macro to simulate an XOR instruction. 


combines the Y and 2 fields for 16 bits. 
In still oilier instructions, he combines the 
X, Y, and Z fields into a 24-bit field. 

For starters, I'll define a type to hold an 
instruction, keeping in mind ihaL ANSI C 
implicitly requires unsigned long to hold 
32 bits or more. Remember that ANSI C 
does not require any particular byte or¬ 
der when storing larger-than-byte objects 


in memory. Using a trailing _T convention 
to indicate a type, you can define a type 
(MAllX_lnstr_T) to contain the object code 
for one instruction, as shown in mmix- 
com.h (Listing Two). 

Taking advantage of the implicit ANSI 
C requirement that unsigned char be 8 
bits or larger, mmixcormh also defines 
ty pes for bytes in general and instruction 


opcodes in particular, r call these types 
MMIX_Byte_T and MMIXjOpcode_Z re¬ 
spectively. 

To use the bitops.h macros to extract 
die opcode from an instruction, for exam¬ 
ple, you need to specify start-bit numbers 
and bit lengths. Listing Two also contains 
equates called MMIX_l'NSTR_OPCODE_ 
START and MMDLINSTRjDPCODE_LEN 
for this. 

Given those bit numbers, you can use 
the IBITS (integer extract bits) macro to 
extract the opcode from an instruction; 
see Example 2. You will recall that IBiTS 
right-justifies its result. 

MMIX also stores the X, Y, and Z fields 
as bytes. Example 3 shows how to create 
an instruction from scratch using MVBITS. 
In this case, ! am creating an instruction 
to set r40 (register 40) to the unsigned 
sum of registers 41 and 42. 

Shifting Bits with MMIX 
and bifops*h 

MMIX lias an SRU (shift right unsigned) 
instruction. SRU rj= r4»r5 is a shift-right 
unsigned iastniction in Knuth s current as¬ 
sembler syntax that sets register 3 to reg¬ 
ister 4 shifted right by the number of bits 
indicated in register 5. If the value in reg¬ 
ister 5 is greater than or equal to the size 
of a register in bits, then register 3 will lie 
set to zero. Example 4 shows a short rou¬ 
tine that simulates die SKI I instruction using 
the bitops.h RJGHT_S(-11F1_BITS macro. 

You can readily emulate MMIX’S XOR 


MMIX: Knuth’s New Computer 


B ack in the 1960s, Donald Knuth 
designed a hypothetical computer 
called “MIX* for his Art of Comput¬ 
er Program rn ing al gori thins books .MIX 
shows its age in various ways, so Knuth 
is designing a RISC-like successor called 
“MMIX” (short for “Meta-MDC or “Mega- 
MIX"). He started from scratch to avoid 
the restrictions of the old architecture. 
The new computer incorporates Big- 
endian byte ordering, byte addressing, 
two's- complement integer arithmetic and 
IFIRE floating-point arithmetic. 

Knuth hits not yet published his de¬ 
scription of MMIX. His latest draft is 
dated August 20, 1992. He expects to 
make many technical changes in his next 
draft, due sometime in 1995, so details 
given here may change as well 
Knuth plans to use MMIX for the later 
volumes of the Art of Compiitet*Program¬ 
ming series. I myself hope to write a 
cross assembler and simulator for MMIX 
for publication in Dr. Dobh'sJournal. 


This drives my exploration of “big- 
integer" (64-bits or more) routines for 
C T as well as 64-bit, portable object-file 
formats (like MLJFOM and ELF). 

in MMDy Knuth lias adopted the com¬ 
mon definition of a byte as having 8 bits, 
tie is much more generous with regis¬ 
ters; MMIX has 256 general-purpose 
registers. Knuth has also accounted for 
other practical issues this time. His de¬ 
scription of MMIX floating point ac¬ 
knowledges that on some models, the 
system may trap floating-point 'instruc¬ 
tions" and interpret them in software. 
MMIX is supposed to have virtual mem¬ 
ory, although the current draft doesn't 
seem to have enough detail for an op¬ 
erating system to deal with page faults 
or page tables. 

The 1992 draft defines a 32-bit sys¬ 
tem. but Knuth is likely to convert to 64 
bits before he publishes the final ver¬ 
sion, He has also had second thoughts 
about a number of complications the 


draft introduced: regions, probable 
branches, and delay slots, for example. 
Regions are a simple way to provide 
multiple address spaces, kind of like seg¬ 
ment registers. The delay slot avoids hav¬ 
ing to refill the prefetch buffer because 
of the branch, l first saw delay slots be¬ 
ing used on the MIPS when I worked at 
Microsoft. Many of us with previous 
assembly-language experience kept for¬ 
getting th at die ins true Lion in the delay 
slot would execute, too. 

Probable branches and delay slots are. 
in my opinion, architectural warts to im¬ 
prove pipeline performance while driv¬ 
ing the assembly-language programmer 
crazy. In a pipelined system, the in¬ 
struction right after a branch has already 
been prefetched, so why not execute it? 

It seems, however, that Knuth is re¬ 
considering these additions, and they 
will probably l>e dropped from the next 
draft. 

—JR- 
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MMIX_Word_T 

Sim_N0R( /* Simulate NOR bits instr, */ 
MMIX_Word_T Some-Bits, 

MMIX.Wo rd_T Qther.Bits) 

{ 

return (NGR.BITS( 

Some.Bits, 

Other_Bits, 

MHIX_Word_T)); /* type */ 


Example 6: Using the NOR_BlTS macro to simulate a NOR instruction , 


instruction using the bitops.h IEOR (inte¬ 
ger exdusive-OR) macro; see Example 5. 

MMIX is perhaps unique among in¬ 
struction sets in having a nor bits (NOR) 
instruction. You may l>e familiar with NOR 
and NAND gates from digital electronics. 
The corresponding bitops.h macro is 
NOR_BITS. Example 6 shows how the 
NOR_BTTS macro may be used to simu¬ 
late MMDCs NOR instruction. 

Conclusion 

C provides some powerful bitwise oper¬ 
ators, although you need to be careful 
regarding the widening of values between 
different data types. The macros in bit- 
ops. h provide a more complete set of 
bitwise operations than bare C, with pro¬ 
tection against widening problems, al¬ 
though you must avoid macro arguments 
with side effects. You should also avoid 


using any of the signed data types with 
the macros. 
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PowerPC Bi-Endian Capabilities 


Jim Gillig 

T he PowerPC is a Bi-endian RISC 
processor that supports both Big- 
and Little-endian addressing mod¬ 
els* The Bi-endian architecture provides 
hardware and software developers with 
the flexibility to choose either mode 
when migrating operating systems and 
applications from their current BE or LE 
platforms to the PowerPC. Program in¬ 
structions are like multibyte-scalar data 
and are subject to the byte-order effect 
of Endianness* 

Each individual PowerPC machine in¬ 
struction occupies an aligned word in 
storage as a 32-bit integer containing 
that instruction's value. In general, the 
appearance of instructions in memory is 
of no concern to the programmer. Pro¬ 
gram code in memory is inherently ei¬ 
ther a LE or BE sequence of instructions 
even if it Is an Endian-neutral imple¬ 
mentation of an algorithm. 

How does the PowerPC handle both LE 
and BE addressing models? The processor 
calculates the effective address of data and 
instructions in the same manner whether 
in BE mode or LE mode; when in LE mode 
only, the PowerPC implementation further 
modifies the effective address to provide 


Janies R. is a software engineer on OS/2 
and IBM Woiitplace technologies in Boca 
Raton , FL He can be reached through 
the DDJ offices. 


the appearance of LE memory to the pro¬ 
gram for bads and stores. 

The operating system is responsible 
for establishing the Endian mode in 
which processes execute. Once a mode 
is selected, all subsequent memory loads 
and stores will be affected by the 
memory-addressing model defined for 
that mode. Byte-alignment and perfor¬ 
mance issues need to be understood be¬ 
fore using an endian mode for a given 
application. Alignment interrupts may 
occur in LE mode for the following load 
and store instructions: 

* Fixed-point load instructions. 

* Fixed-point store instructions. 

* Load-and-store w ilh byte-reversal in¬ 
structions. 

* Fixed-point load-and-store multiple 
instructions. 

* Fixed-point move-assist instructions. 

* Storage-synchronization instructions, 

* Floating-point load Instructions* 

* Floating-point store instructions* 

For multibyte-scalar operations, when 
executing in LE mode, the current 
PowerPC processors Lake an alignment 
interrupt whenever a load or store in¬ 
struction is issued with a misaligned ef¬ 
fective address, regardless of whether 
such an access could he handled with¬ 
out causing an interrupt in BE mode. 
For code that is compiled to execute on 
the PowerPC in LE mode, the compiler 
should generate as much aligned data 
and as many aligned instructions as pos¬ 


sible to minimize the alignment inter¬ 
rupts, Generally, more alignment inter¬ 
rupts will occur in LE mode than in BE 
mode. When an alignment interm pt oc¬ 
curs, die operating system should han¬ 
dle the interrupt by software emulation 
of the load or store, 

A very powerful feature of the Power¬ 
PC architecture is the set of integer load- 
and-store instructions with byte rever¬ 
sal dial allow applications to interchange 
or convert data from one Endian type 
to the other, without performance 
penalty* These load-and-store instruc¬ 
tions are Ihhrx/sthhrx, load/store 
halfword byte-reverse indexed and tw- 
brx/stwbrx, load/store word byte- 
reverse indexed. They are ideal for em¬ 
ulation programs that handle LE-type 
instructions and daLa, such as the em¬ 
ulation of the Intel instruction set and 
data* These instructions significantly im¬ 
prove performance in loading and stor¬ 
ing LE data while executing PowerPC 
instructions in BE mode and emulating 
the Intel instruction behavior; this elim¬ 
inates the byte-alignment and data- 
conversion overhead found in archi¬ 
tectures that lack byte-reversal instruc¬ 
tions, Currently, these instructions can 
he accessed only through assembly lan¬ 
guage. Until C compilers provide sup¬ 
port to automatically generate the right 
load and store instructions for this type 
of data, C programs can rely on mask¬ 
ing and concatenating operations or em¬ 
bed the assembly-language byte-rever¬ 
sal instructions. 
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C MACROS 
Listing One 


/* BitOps.h - bit operation macros, Copyright (e) 1987-1994 by JR 

* (John Rogers). All rights reserved, CompuServe: 72634,2402 

* Permission is granted to use these macroa in compiled code without payment 

* of royalties or inclusion of a copyright notice. This source file 

* Bay not be sold without written permission fro* the author. 

* 

* The following macros are inspired by the FORTRAN bit operation routines in 

* «Ii-STC-l753- Except for HVBITS. all return values rather than Bodifying 

* parameter, KYBIT5 updates a parsetter and does not return anything. The 

* leading “I* 1 in moat of these names means that they return some hind of 

* integer result. 

* RTEST[value,bitnum.type) 

+ LAND(n,n,type) 

* IBCUtf value, bitnuni. type) 

* IBITS(value,bitnum,len,type) 

* IBSET(value.bitnum.type) 

* IEOR(a.n,type) 

* IOR(m.n,type) 

* ISUIT (value.shifts.type) 

* I5HFTC{value.shift a,leu.type) 

* KVBIT5(s re,sreindex,Ian.de etp t r , dostindex.type) 

* HOT (value, type) 

* The following C macros were invented by me (JR) or various other C 
■ programmers: all return values rather than modifying paraBeters: 

* ALL_OHE_BITS(type) 

* ALL_ZERD_B1TS (type) 

* BIT_mm_AMD _LEN_TO_MASK (bitnum. left, type) 

* B3T_NtfM_TG_MA&K(hitnum.type) 

* CLEAR., BITS .US ING.HASX {value. mask. type) 

* FUf_!)ITS_USl»G_»IASK (value. rcask. type) 

* LEFT.CIRCULAR.SHIFT.BIT5£value,shift s,left,type) 

* LEFT.SHIFT-BITS(value.shifts.leu,type) 

* NANI>_ BITS (fi.n, type) 

* NOR_B ITS(m.n.type) 

* RXGET.CIRCULAR-SHIFT.BITS(value.shifts ,Ien,type) 

* RIOffr-SHIFT-BITS (value, shift a r len, type) 

* SET.BITS-US IMG.HASK (value, mask. type) 

* TEST_BITS_USING_MASK(value.Bask.type) 

* TYRE.SIZE_IM-BITS{type) 

* XNOB-BTTS(m.n.type) 

* Beware Of Side effects: many macrce in this file evaluate their arguments 

* Bore than once. These are marked with EVALTWICE consents. 

*/ 

/* Gracefully allow multiple includes of this file. +/ 
fifndef BITOPS-H 
#define BITOPS JI 

/ttmtittmiMt*** INCLUDES **■•*.. 

/•lint -efil#(766,limits.h) */ 

Hinclude <limits,h> /* CHAR_SlT, */ 

/••***«•***••**•***••**« MACROS K4**Mt*tt4***i/ 

/* ALL-ONE-BITS{type): Generate a value of type "type" with all bite aet to 

* i. "type" must be an unsigned integral type. 

•/ 

N define ALL_ONE_BlTS (typ«) ( (type) '‘((type)0) ) 

/• ALL_2ER0_BITS{type): Generate a value of type "type 11 with all bits set to 

* 0. N type“ must he an unsigned integral type, 

*/ 

#define ALL-ZERO.BITS(type) ( (type) 0 ) 

/* B3T,NUM-AWO.LEN-TO_MASR(bitnum. len,type]: Return a mask of type "type". 

* with "len" bits on. starting at "bitnum". Bit 0 is L£B, bits start at 

* “bitnum" and are turned an in the mask starting at "hitman" and going to 

* the left* "typo' 1 ' Bust be an unsigned integral type. 

*/ 

/♦EVALTWIGE*/ 

* define BIT-NUM-AJlD-LEN-T¥J_MA5K(b itnum. len. type) \ 

/•CQN5TC0ND*/ V 

/■lint -save -eiBfi -e5?2 -e7 78 */ \ 



( (ALL_0NE_BITS(type)) \ 

» ((TYPE-SIZE.IN_BITS(type)) \ 

- £(bitnum}*(type)(len)) ) 3 \ 

* ( \ 

((bitnua)»0) \ 

7 "( ALL-ONE-BITS l type) V 

» { (TYRE.SIEE.IN.BITS(type)) V 
- (hitman) ) ) \ 

: ALL-ONE.BITS(type) ] \ 



/■lint -reatore •/ 

/+ BIT.NUN.TO.MASK[bitnum,type): Convert hit number "bitnum" to maek of type 

* “type", Bita are numbered from right to left, with bit 0 being the least 

* significant hit (LSB). "type" must be an unsigned integral type- 

* This is my (JR's) modification of something posted to Usenet by Bill 

* Shannon (shannon0SUn.UUcp) many years ago. 

*/ 

$ define BIT. .NUH_.TO_MASK{bitnun. type) \ 

t (type) ( ((type)l) << [(type)(bitnua)) ) ) 

/* BTESTfvalue,bitnum,type): Test bit numbered "bitnum" in "value", which must 

* be of type "type", If the tested bit ie on. return a Boolean true value 
■ (1): otherwise, return a Boolean false (0). "type" must be an unsigned 

* integral type, 

*/ 

^define BTEST(value.bitnum,type) \ 

( £ (value) & CBIT_Nlffl_TO_KASK((bitnua).type)) ) V 

n 1 0 \ 

) 

/■ CLEAR-BITS_U$ING_MASK(value.ma s k.type): Return "value", except that any bits 

* which ore turned on in "mask" will be turned off in the return value. 


* "type" oust be an unsigned integral type. 

*/ 

Idefine CLEAR_BIT5_USING.MASE(value.mask,type) \ 

( (type) ( (value) & "(mask) ) ) 

/* FLIP.BITS_USI»G_HA5K(value.mask,type)? Return "value", except that any bits 

* which are turned an in "mask* will be flipped (toggled) in the return 

* value, "type" must be an unsigned integral type. 

•/ 

ftdefin# FLIP. BITS _USING_KA£K (value.mask,type) \ 

( (type) ( (value) * (mask) ) ) 

/• IAND(m.n.type): Return the bitwise "and" of the integral values "a" and "n". 

* "typo" must be an unsigned integral type, 

*/ 

^define IAND(m.n.type) \ 

{ (type) ( (m) & (n) ) J 

/* IBCLR (value .bitnum, type) [ Return "value" with hit at "bitnum" cleared 

* [zeroed). "type" must be an unsigned integral type. 

*/ 

idefine IBCLR(value,bitnum*type) \ 

( \ 

(type) CL£AR_BITS_USING_MASK( \ 

(value), \ 

BIT.NUM-TO-KASK( (bitnua), type ). V 

/* IBITS(value,bitnum,len,type)i Extract bits from "value", starting at hit 

* "bitnua", for "len" bits. The result will be right justified, "type" must be 

* an Unaigned integral type. 

*/ 

/♦EYALTWTCH*/ 

^define IBITS(value,bitnum,len,type) \ 

/•CGNSTCQMD*/ \ 

/•lint -save +/ /* Rreserve PC-LINT options. */ \ 

/■lint -e572 ♦/ /• Ignore excessive shift val */ \ 

/•lint -e77fl •/ /* Ignore const expr aval to 0 */ \ 

£ (ty^e) \ 

£ (value) k \ 

(BIT_NUM_AM)_LEN_TO_MA£K{ \ 

(bitnum), (len), type 1) ) \ 

>> (bitnua) S 

> V 

) \ 

/■lint -restore */ 

/• IBEET(value,bitnum,type): Return "value" with bit at "bitnum" set to true, 

* "type" must be an unsigned integral type, 

•/ 

(define IBSET(value.bitnum,type) \ 

( ^ 

SET.BITS.USING-MASKf \ 

(value), \ 

BIT-NUM. TO-MASIt ( (hitnum). type ), \ 
type) \ 

) \ 

> 

/* IEOR{m.n, type): Return the bitwise excluaiVe-or Of the integral values "a" 

* and "n", "type" must be an unsigned integral type, 

*/ 

(define IEOR(m,n,type) \ 

( (type) ( (a) * (n) ) ) 

/• IOR(m.n,type): Return the bitwise "nr™ of the integral values "m" Hud "n". 

* "type" must be an unsigned integral type, 

*/ 

idefine IOR(a.n.type) \ 

{ (type) ( (a) 3 (n) ) ) 

/• IShFTE value.shifts.type) : Return "value" with bita logically shifted as 

* specified by "shifts". Zeros will be ahifted-in as applicable, A positive 

■ amount for "shifts" causes a left shift: a negative amount causes a right 

* shift; a zero amount causes no shift. Note that the absolute value of 

■ "shifts" must be less than or equal to TYPE-SIZE_IN-BITS("type"j. Also note 

* that "value" must be Of type "type 11 r and "type" must be an unsigned 

■ integral type. 

*/ 

/■EVALTV1CE*/ 

^define ISHRT(vslue,shifts,type) \ 

/ * CONSTCONT*/ \ 

/•lint -save */ /* Preaerve PC-LINT settings, •/ \ 

/•lint -e504 •/ /* Ignore unusual shift value */ \ 

/•Tint -e77B */ /* Ignore const expr eval to 0 */ \ 

( (type) \ 

( ((shifts)>0) \ 

7 ( (value) << (shifts) ) \ 

: { £ (shifts)<0 ) \ 

7 £ (value) » (-(shifts)] ) \ 
t (value) \ 

1 \ 

) \ 

) \ 

/♦lint -restore */ 

/• IE3IFTC(value.shifts.len,type): Return "value" with bits circularly shifted 

* (as specified by "shifts") within the lower "len 1 ' bits of “value", A 

* positive amount for "shifts 11 causes a left shift: a negative amount causes 

* s right shift: s sero amount causes no shift. Note that the absolute value 

* of "shifts" must be less than or equal to "len". Also note that "value" 

■ must be of type "type", and "type" must be an unsigned integral type, "len’ 1 

* must be greater than 0 and leas than or equal to TYPE-SIZE.IN-BITS("type"). 
•/ 

/•EVALTVTCE*/ 

^define ISHFTC(value,shifts.len,type) V 
/♦lint -Save -e50i */ V 
( (type) ( \ 

£ ((shifts) — 0) \ 

![ £(len) ■= (type) (shifts)) \ 
l! ((len) ” - (type) (ahifts)) ) \ 

7 ((type)(value)) \ 

! ( \ 

{ (shifts) > 0 ) V 
1 (ElGHT-ClRGUlAR-SNlPT-BlTSf \ 
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(value), (shifts) * (len), type) ) \ 
i (LEFT-CIRCUIAR-SHIFT-BITS( \ 

(value) p (type) (- (shifts)). \ 

(len), type) ) ) \ 

) \ 

) /Mint 'restore */ 

/* im'.C1RCUUK-SHIFT.BITS (value,shifts,len, type) 3 Return ’'value™ with hits 

* circularly shifted left ■’shifts'' bits within the lower "len" bits of 

* "value"^ A zero amount for "shifta" causes no shift. Note that "shifts" 

* must be less than or equal to "leu". Also note that "value" must be of type 

* "type". and "type" must be an unsigned integral type, "len" must also be 

* greater than aero and less than or equal to TYPE_SlZl_rN_BTTS("type")* 

*/ 

/♦EVALTWICE*/ 

♦ define LIFT-ClRCULAR_,SH1FT.BITS( \ 
value.shifts,len,type) \ 

/♦lint -save -eW */ \ 

{ (type) ( V 

((shifts) =0) !! ((len)«= (type) {shifts)) \ 

T (value) \ 

: { ( (value) ft \ 

“Brr_MUM_ANl>_liEH_TO-HASK(0 H (len) .type) ) \ 

| ( ( (value) A (BIT_NUK_ANfLLElLTO_ltASK( \ 

0, (len), type )) ) \ 

>> (shifts) ) \ 

! ( ( (value) A (BIT-MUM.AND_LEN_TO.HASK( \ 

0, (shifts), type )) ) \ 

<< ((len)-(type)(shifts)} ) ) ) \ 

) /Mint -restore */ 

/* LEFT.SHIFT-BITS(value,shifts Jen,type): Return "value" with bits logically 

* shifted left "shifts" bita within the lower "len" bits of ’'value". If 

+ necessary, aero bits are added on the right. A aero amount for "shifts" 

* causes no shift- Note that "shifts" must be less than Dr equal to "len", 

* Also note that "value" must he of type "type", and "type" must he an 

* unsigned integral type, "len" must also be greater then zero and lesa than 

* or equal to TYPE.SIZE.IN-BITS("type"). 

*/ 

/+EVALTWIGE*/ 

♦define LETT.SHIFT.BITS(value.shifts.len,type) \ 

/Mint -save -e504 */ \ 
t (type) { \ 

( ((shifts)“0) !! ((len)”(type) (shifts)) ) \ 

1 (value) \ 
i ( ( (value) A V 

~BIT_NUM_AHD_LIN_TO_MAIK(0, (len). type) ) \ 

[ ( { (value) <i (shifts) ) \ 
i tmr-NUK-ANp.HN-TO_HAS(U \ 

0, (len), Lype )) ) ) \ 

) V 

) /Mint -restore */ 

f* NVBITS(sre.oreindent,len.destptr.destindex.type): Update the value that 
■ "destptr" points to, using bita extracted from "sre" starting at bit 

* "arcindex" for "len" bite, "destindex" indicates the bit number in the 

* destination to begin updates, "type" must be an unsigned integral type. 

*/ 

/•EVALTWICB*/ 

♦define MVBlTSt \ 

arc,s rcindex«I en,dest pt r,deat index,type) \ 

/*CONSTCQND*/ /Mint 'save -e506 */ \ 

( \ 

type arcbits » \ 

(sre) & BIT_MraLAMC_LEN_T0JW5K( \ 

(arcindex). (len), type ); \ 
type destmsak = BIT_NUK_AND_iEM._TO_KASK( \ 

{destindex), (len). type ): \ 

*(destptr) fc= "destmask; \ 

*(doetptr) !■ I5HFT{ \ 
arc hit a. \ 

fint) ((destindex)-(arcindex)). \ 
type ); N 

3 /♦lint -restore */ 

/* HAND-BITS(m.n.type); Return the bitwise "nsnd" of the integral values 

* "m" and "n'\ "type" mufit be an unsigned integral type. 

*/ 

♦define HAND BITS(m,n.type) \ 

( (type) - £ tAND((ra).(n).type) ) ) 

/+ NOR,BITS(m.n,type): Return the bitwise "nor" of the Integral values "m" and 

* "n™. "type™ must be an unsigned integral type, 

*/ 

♦define NUR_BITS£m,n,type) \ 

{ (type) ™ ( IOR((m). (n),type) ) ) 

/* NOT(value.type): Return all bits of "value" flipped. Note that "value™ must 

* be of type "type", which must be an unsigned integral type, 

*/ 

•define NOT(value,type) ( (type) ~£(type)(value)) ) 

/* RIGHT.CIRCULAFLSHIFT-BITS(value,shifts,len,type), Return "value™ with bits 

* circularly shifted right "shifts" bita within the lower "len" bits of 

* "value". A zero amount for "shifts" causes no shift. Note that "shifts" 

* muet he leas than or equel to "len". Also note that "value" must be of type 

* "type", and "typo" must be an unsigned integral type. "Ion™ must also be 

* greeter than sero and less than or equal to WPE_SIZ£_IN_BITSC™type’’). 

*/ 

/♦EVALTWTOS*/ 

♦define RIGHT.CIRCULAR,SHIFT-BITS( \ 
value.Shifts.len,type) \ 

/Mint -save -e504 */ \ 

( (type) { \ 

((shifts) *^=0) |) ((len)—{type) (shifts)) \ 

1 (value) \ 

: ( ( (value) A \ 

“BJT_NUM-AMD_LEN_TO_HASK(0,(len).type) ) \ 

! f ( (value) k (BIT-NUH-AND-LEN.TOJlASKf \ 

0. ((len)-(type)(shifts)),type)) ) \ 
tt(shifts)) \ 

! ( ((ve 1 ue) & (BTT-NDH JNU_LEM_TO_MASK ( \ 

((lan)-(type){shifts)).(shifts),type))) \ 

» ((len)-(type)(shifts)) 3 ) ) \ 

) /Mint -restore 1 */ 

/• RIGUT-SHIFT-BITS(value T shifts,len.type); Return 'Value™ with bite logically 


* shifted right "shifts™ bits within the lower Men 1 * bite of "value". If 

* necessary, zero bits are added on Lhe left. A zero amount for "shifts" 

* causes no shift. Note that "shifts" must be lesa than or equal to "len", 

* Also note thHt "value" must be of type "type", and "type" tnUSt he an 

* unsigned integral type, "len™ must also be greater than aero and 

* leas than or equal to TYPE.SIZE.IN.BITSf 1 "type"). 

*/ 

/ *evaLW 1 Cfit / 

• define RIGHT.SHIFT.BITS(value,shifts,loti* type) \ 

/♦lint -save -e504 ♦/ \ 

{ (type) ( \ 

( ((shifts) =0) H ((len)—(type) (shifts)) ) \ 

7 (value) \ 

; ( ( (value) A \ 

^BIT_NUM_AHT_LEH.TO_MASK{0,(1an),type) ) \ 

! ( ( (value) k (BIT-MUM.AND_LEN.TO.MASK£ V 
0, (Ion), type )) ) (shifts) 3 ) \ 

) /*lint -restore */ 

/* SET-BITS-USIMG-KASK(value,maak.type ): Return "value", except that any bits 

* which are turned on in "mask" will also be turned on in the return 

* value, "type" must be an unsigned integral type. 

*/ 

♦define SET.BITS.USIM0_KASK(value,mask,type) \ 

{ (type) ( (value) ! (mask) > 3 

/* TEST.fllTS-USlNG-MASK.(value.maak , type): RetUm "value", except that only bits 

* which are turned on in. "mask" will be returned, "typo" must be an 
■* UTiaigned integral type. 

*/ 

• define TEST.BITS _US1MC-MASK(value,ma a k, type) \ 

( (type) { (value) k (mask) ) ) 

/* TVFE-STEE_IN-BITS(type): Return the number of bits required for type "type". 

*/ 

• define TYPE.SIZE.1 (LETTS(type) \ 

( (type) ( sixeof(type) * CHAR,BIT ) ) 

/♦ XNOR-BTTS(m,n.type): Fetom the bitwise exclusive "nor" of the integral 

* values "m" and "n". "type" must be an unsigned integral type, 

*/ 

•define XMDR,BITS(m,n,type) \ 

( (type) “ ( IE0R((m),(n).type) ) ) 

•endif /* BITOPS-H */ 


Listing Two 

i 

/♦ mmixcoa.h— MMIX cammon defns. Copyright £e) 1994 by JR (John Rogers). 

♦ All rights reserved, CompuServe; 1 2d34. 1 402 

♦ FUNCTION - iziixcon,h contains types and equates used fot defining MMIX 

* instructions in object code format, 

* We take advantage of the implicit ANSI C requirement that unsigned char be 8 
► bits or larger. Similarly, we can fisairmfi unsigned long is 32 bite or larger, 
*/ 


Wifndef MHIXCOK-H 
♦define MMIXCOK.H 

/* Define a typo for one instruction. Note that thin will be at least 32 bits. 

♦ depending on the compiler, 

*/ 

typedef unsigned long HNlX_Tnstr_T; 

/* We also need to deal with single words in MMJX. These are currently 32 bita 

• wide, although Smith is likely to change them to 64 bite eoon. 

*1 

typedef unsigned long HMIX_Word„T: 

•define KMULWORD-LBN 32 

/* Many parte of HMIX worda are in bytes- In MMIX. a byte is 8 bits lung. In C, 

♦ this might be larger. 

typedef unsigned char MMIX-Byte^Tj 

/# Even if "char" is more than. 8 bits, leave this, */ 

♦ define MMI3LBYTE_ElT_LEJt G 

/* Define a type for an opcode, */ 
typedef MMiX_Byte-T MMIX.Dpcode_TE 

/* Define equates for each part of MMIX-lustr-T. Use bit numbering convention 

* of 0=least significant bit (LSB). 

• / 

♦define MMIX INSTR,OPCODE.START 24 

•define MMtX_rNSTELOFCGOfi_LEN MMIX_BTTI.BIT.LEN 


• define MNTONSTlULSTAirr 16 

•define HMIX,INSTR.X.LEN MMIX^BYTE_BIT_LEN 


•define KHIX.INSTR Y.START 3 

•define KMIX_INSTR.Y_LEN MMlX_BYTF.BIT_LEN 


•define MMIX_INSTR_E_START 0 

•define MM1X-INSTR.Z_LEN MMIX BYTE_BIT_LEN 

• endif /* MHIXGQH.il ♦/ 


End Listings 
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RAMBLINGS IN REAL TIME 





Frames of Reference 


evcral years ago t I opened a column 
in Dr. Dobb ’s Journal with a story 
about singing my daughter to sleep 
L/ with Beatles songs, Beatles songs, at 
least the earlier ones, tend to lx; bouncy 
and pleasant, which makes them suitable 
good-night fodder— and there are a lot 
of them, a useful hedge against terminal 
boredom. So for many good reasons, 
“Can’t Buy Me Love” and “Hard Day's 
Nigltl” and “Help! 11 and the rest were 
evening staples for years. 

No longer, though. You see T I got my 
wife some Beatles tapes for Christmas. 
We've all been listening to them in the car, 
and now that my daughter has heard the 
real tiling, she can barely stand to be in 
the same room, much less fall asleep, 
when I sing those songs. 

What's noteworthy is that the only 
van a lil e involved in this change was my 
daughter's frame of reference. My singing 
hasn’t gotten any worse over the last four 
years. (I'm not sure it's possible for my 
singing to get worse.) A11 that changed 
was my daughter's frame of reference for 
those songs. The rest of the universe 
stayed the same; the change was in her 
mind—lock, stock, and barrel. 

Often, the key to solving a problem or 
working on a problem efficiently is a prop- 


Michael is the author of Zen of Graph¬ 
ics Programming and Zen of Code Op¬ 
timization. He is currently pushing the 
envelope of real-time 3-D on Quake at 
id Software , He can he reached at 
m i kea h@idsoftwa re . com. 


Michael Abrash 


er frame of reference. Your model of Lhe 
problem often determines how deeply you 
can understand it, and how flexible and 
innovative you can be m solving it. 

An excellent example of this, and one 
which 111 discuss toward the end of this 
column, is that of 3-D transforms—the 
process of converting coordinates from 
one coordinate space to another, for ex¬ 
ample from woridspace to views pace. The 
way this is traditionally explained is func¬ 
tional, but not particularly intuitive, and 
fairly hard to visualize. Recently, I’ve come 
across another way of looking at trans¬ 
forms that seems far easier to grasp. The 
two approaches are technically equiva¬ 
lent, so the difference is purely a matter 
of how w r e view f tilings—but sometimes 
that’s the most important difference. 

Before we can talk about transforming 
between coordinate spaces, however, we 
need two building blocks: dot products 
and cross products. 

3-D Math 

In my last column 1 promised to present 
a BSP-based renderer this month, to com¬ 
plement the BSP compiler we’ve devel¬ 
oped over the last two columns. But the 
considerable amount of mail about 3-D 
math that I’ve received over the last two 
months changed my mind. In every case, 
the writer bemoaned his or her lack of ex¬ 
pertise with 3-D math, asked me to rec¬ 
ommend books about 3-D math, and 


questioned how they could learn more. 

Thai’s a commendable altitude, but the 
truth is, there’s not all that much to 3-D 
math, at least for the sort of polygon- 
based, real-time 3-D done on PCs, You 
really need only two basic math tools be¬ 
yond simple arithmetic: dot products and 
cross produces; mostly, just the former. My 
friend Chris Hecker points out dial diis is 
an oversimplification; math-related oper¬ 
ations like BSP trees, graphs, discrete math 
for edge stepping, and affine and per¬ 
spective texture mappings also go into a 
production-quality game While that’s true, 
dot and cross products, together w ith ma¬ 
trix math and perspective projection, con¬ 
stitute die bulk of what mosL people mean 
by “3~D math." As well see, dot anti crass 
products are key tools for a lot of useful 
3-D operations. 

Die mail also made dear that a lot of 
people out there don't understand dot or 
cross products, at least insofar as they ap¬ 
ply to 3-D. Since just about everydiing III 
do in this column relies to some extent on 
dot and cross products (even the line- 
intersection formula I discussed last time 
is actually a quotient of dot products), Ill 
devote this column to examining these ba¬ 
sic tools and some of their 3-D applica¬ 
tions. If tills Ls old hat to you, my apolo¬ 
gies; 111 return to BSP-based rendering 
next time. 

A Little Background 

Dot and cross products themselves are 
straightforward and require almost no con¬ 
text to understand, but \ need to define 
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some terms III use when describing their 
application, 

1 assume you have some math back¬ 
ground, so Ill quickly define a “vector” 
as a direction and a magnitude, repre¬ 
sented as a coordinate pair (in 2-D) or 
triplet (in 3-D), relative to the origin. That’s 
a pretty sloppy definition, but it’ll do for 
our purposes; for the real McCoy, check 
out Calculus and Analytic Geometry , 
Eighth Edition, by George B, Thomas, Jr. 
and Ross L. Finney (AddisomWesley, 1991, 
ISBN 0-201-52929-7). 

So, for example, in 3-D, the vector V= 
[5 0 51 has a length, or magnitude, of 
IJ V 11=5^2, by the Pythagorean theorem, 
as shown in Example 1 (vertical double 
bars denote vector length), and a direc¬ 
tion in the plane of the x and z axes, ex¬ 
actly halfway between those two axes. 

Ill be working in a left-handed co¬ 
ordinate system, whereby if you wrap the 
fingers of your left hand around the z axis 
with your thumb pointing in the positive 
z direction, the fingers will curl from the 
positive x axis to the positive y axis. The 
positive x axis runs left to right across the 
screen, the positive y axis runs bottom to 
top, and the positive z axis runs into the 
screen. 

For our purposes, projection is the pro¬ 
cess of mapping coordinates onto a line 
or surface. “Perspective projection” pro¬ 
jects 3-D coordinates onto a viewplane, 
scaling coordinates according to their z 
distance from the viewpoint in order to 
provide proper perspective. “Objectspace” 
is the coordinate space in which an ob¬ 
ject is defined, independent of other ob¬ 
jects and the world itself, "Workspace" is 
the absolute frame of reference for a 3-D 
world; all objects 1 locations and orienta¬ 


lly I [=>/v^+v|+v§=V5 2 +0 2 +5 2 =5 a/2 


Example 1: The Pythagorean 
theorem in 3-A where the vector V- 
15 0 51 has a length, or magnitude, of 
5<2. 



tions are with respect to worldspace, and 
this is the frame of reference around which 
the viewpoint and view direction move, 
“Viewspace" is worldspace as seen from 
the viewpoint, looking in the view direc¬ 
tion, “Sereenspace” is viewspace after per¬ 
spective projection and scaling to the 
screen. 

Finally, u transformation” is the process 
of converting points from one coordinate 
space into another; in our case, that’ll 
mean rotating and translating (moving) 
points from objectspace or worldspace to 
viewspace. 

For additional information, check out 
Computer Graphics: Principles and Prac¬ 
tice, , Second Edition, by James D, Foley 
and Andries van Dam (Addison-Wesley, 
1990. ISBN 0-201-12110-7), or my X-Sharp 
columns in DDJ in 1992; those columns 
are also collected in my book Zen of 
Graphics Programming (Coriolis Group 
Books, 1995, ISBN 1-883577-08-X). 

The Dot Product 

Now for the dot product. Given two vec¬ 
tors IHu t u 2 uj and Y-tvj v 2 v 3 ], their dot 
product (denoted by the * symbol), is cal¬ 
culated as in Example 2(a). The result is 
a scalar value (a single, real-valued num¬ 
ber), not another vector, 

Now that you know how to calculate 
a dot product, what does that get you? 
Not much. The dot product isn’t much 
use for graphics until you start thinking 
of it as in Example 2(b), where 0 is the 
angle between the two vectors and the 
other two terms are the lengths of the 
vectors, as shown in Figure L Although 
it’s not immediately obvious. Example 
2(b) has a wide variety of applications in 
3-D graphics. 


(a) U V^u^+UsVa+UaVg 

(b) U V=cos{0) nun IIVII 


Example 2: (a) Calculating a dot 
product; (h) using dot products for 
3-D graphics. 



Dot Products of Unit Vectors 

The simplest case of the dot product is 
when both vectors are unit vectors; that 
is, when their lengths are both one, as 
calculated as Example L In this case, 
Example 2(b) simplifies to Example 3(a). 
In other words, the dot product of two 
unit vectors is the cosine of the angle be¬ 
tween them. 

One obvious use of this is to find an¬ 
gles between unit vectors, in conjunc¬ 
tion with an inverse cosine function or 
lookup table. A more useful application 
for 3-D graphics is in lighting surfaces, 
where the cosine of the angle between 
incident light and the norma) (perpen¬ 
dicular vector) of a surface determines 
the fraction of the light's full intensity at 
which the surface is illuminated, as in 
Example 3(b), where I s is the intensity 
(continued on page 50) 

(a) U V=cos(0) 

(b) l s =I|Cos(0) 

(c) l^IrfNi-D,) 

Example 3' (a) Example 2(b) with 
unit vectors; using dot products for 
lighting surfaces; (c) performing the 
calculation with four multiplies and 
two additions — and no explicit 
cosine calculations , 



Figure 3' Viewspace normal z 
direction doesn't necessarily indicate 
front/back visibility after (perspective 
projection. 


Figure 1; The dot product; U V^cosCB) Figure 2: Lighting intensity is a 
11 U 11 IIV1L function qfcos(B) =N S —T);. 
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RAMBLINGS IN REAL TIME 


(continuedfrom page 47) 
of illumination of the surface, I] is the 
intensity of the light, and 0 is the angle 
between -D t (where D! is the light di¬ 
rection vector) and the surface normal. 
Jf the inverse light vector and the sur¬ 
face normal are both unit vectors, then 
this calculation can be performed with 
four multiplies and two additions— and 
no explicit cosine calculations— as in 
Example 3(c), where N s is the surface 
unit normal and D| is the light unit di¬ 
rection vector; see Figure 2. 

A Brief Aside on Cross Products 

One question Example 3(c) begs is t Where 
does the surface unit normal come from? 
One approach is to store the end of a sur¬ 
face normal as an extra data point with 
each polygon (with the start being some 
point that’s already in the polygon), and 
transform it along with the rest of the 
points. This has the advantage that if the 
normal starts out as a unit normal, it will 
end up that way too, if only rotations and 
translations (but not scaling and shears) 
are performed. 

The problem with an explicit normal is 
dial it will remain a normal — that is, per¬ 
pendicular to the surface— only through 
viewspace. Rotation, translation, and scal¬ 
ing preserve right angles, which is why 
normals are still normals in viewspace, 
but perspective projection does not pre¬ 
serve angles, so vectors that were surface 


UxV=[u 2 v 3 -u 3 v 2 u 3 v 1 _u 1 v 3 UjVg-tfeVl] 


Example 4: Formula for the cross 
product . 

(a) U V=cos<0) HU11 

(b) distanced (P-O p )N p l 

Example 5: (a) Using the dot product 
for projection, (b) using the dot 
product to determine the distance to a 
plane . 



Figure 4: The cross product of two 
polygon edge vectors generates a 
polygon normal; normal=E 0 xE I , 


The dot product 
is the cosine of 
the angle between 
two vectors 


normals in viewspace are no longer nor¬ 
mals in sereenspace. 

Why does this matter? Because, on 
average, half the polygons in any scene 
face away from the viewer, and hence 
shouldn’t be drawn. One way to identi¬ 
fy such polygons is to see whether they 
face tow r ard or away from the viewer; 
that is, whether their normals have neg¬ 
ative z values (so they're visible) or pos¬ 
itive z values (so they should be culled). 
However, we're talking about sereenspace 
normals here, because the perspective 
projection can shift a polygon relative to 
the viewpoint so that although its view- 
space normal has a negative z, its 
sereenspace normal has a positive z, and 
vice versa, as in Figure 3. So we need 
sereenspace normals, but those can’t 
readily be generated by transformation 
from worldspace, 

The solution is to use the cross prod¬ 
uct of tw o of die polygon’s edges to gen¬ 
erate a normal. Example 4 is die formu¬ 
la for the cross product. (Note that the 
cross-product operation is denoted by an 
X.) Unlike the dot product, the result of 



Figure 5: Backface culling with the 
dot product. V 0 -N 0 <Q, so polygon 0 
faces fonvard and is visible; VfNi>O t 
so polygon 1 faces backward and is 
invisible . 


the cross product is a vector. Not just any 
vector, either— the vector generated by 
the cross product is perpendicular to both 
of the original vectors. Thus, the cross 
product can be used to generate a nor¬ 
mal to any surface for which you have 
two vectors that lie within the surface. 
This means that we can generate the 
sereenspace normals we need by taking 
the cross product of two adjacent poly¬ 
gon edges, as in Figure 4, In fact, we can 
cull with only one-third the work need¬ 
ed to generate a full cross product; be¬ 
cause we’re interested only in the sign of 
the z component of the normal, we can 
skip calculating the x and y components 
entirely. ITte only caveat is to be careful 
that neither edge you choose is zero- 
length and dial die edges aren't collinear, 
because the dot product can't produce a 
normal in those eases. 

Perhaps the most often asked ques¬ 
tion about cross products is, Which way 
do normals generated by cross products 
go? In a left-handed coordinate system, 
curl the fingers of your left hand so Lhe 
fingers curl through an angle of less dian 
180 degrees from the first vector in the 
cross product to die second vector. Your 
thumb now points in the direction of die 
normal. 

If you take the cross product of two 
orthogonal (right-angle) unit vectors, the 
result will be a unit vector that’s orthog¬ 
onal to both of them. This means that if 
you're generating a new coordinate 
space—such as a new viewing frame of 
reference—you only need to come up 
widi unit vectors for two of die axes for 
Lhe new coordinate space. You can dien 
use their cross product to generate the 
unit vector for die third axis. If you need 
unit normals and die two vectors being 
crossed aren’t orthogonal unit vectors, 
you 11 have to normalize die resulting vec- 



Figiire 6: The dot product with a unit 
vector performs a projection , 
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tar; that is, divide each of the vector’s 
components by the length of the vector, 
to make it a unit long. 

Using the Sign of the Dot Product 

The dot product is the cosine of the an¬ 
gle between two vectors, scaled by the 
magnitudes of the vectors. Magnitudes are 
always positive, so the sign of the cosine 
determines the sign of the result. The dot 
product is positive if the angle between 
the vectors is less than 90 degrees, nega¬ 
tive if it’s greater than 90 degrees, and 0 
if the angle is exactly 90 degrees. This 
means that just the sign of the dot prod¬ 
uct suffices for tests involving compar¬ 
isons of angles to 90 degrees, and there 
are more of those than you’d think. 

Consider, for example, the process of 
backface culling, discussed earlier in the 
context of using screenspace normals to 
determine polygon orientation relative to 
the viewer. The problem with that ap¬ 
proach is that it requires each polygon to 
be transformed into viewspace, then per¬ 
spective projected into screenspace, be¬ 
fore the test can be performed, and that 
involves a lot of time-consuming calcu¬ 
lation. Instead, we can perform culling 
way back in worldspace (or even earlier, 
in objectspace, if we transform the view¬ 
point into that frame of reference), given 
only a vertex and a normal for each poly¬ 
gon and a location for the viewer. 


<jT 

i 

A * 

Figure 7; Using the dot product to get 
the distance from a point to a plane; 
distance = \ (P—Op)Sp\. 
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Figure 8: Rotating to a new 
coordinate space by projection onto 
the new axes. 


The simplest case of 
the dot product is 
when both vectors 
are unit vectors 


Here's the trick; Calculate the vector 
from the viewpoint to any vertex in the 
polygon, and take its dot product with the 
polygon’s normal, as in Figure 5 If the 
polygon is facing the viewpoint, the re- 
sulL is native, because the angle between 
the two vectors is greater than 90 degrees. 
If the polygon is facing away, the result 
is positive, and if the polygon is edge-on, 
the result is 0, That's all there is to it— 
and this sort of backface culling happens 
before any transformation or projection at 
all is performed, saving a great deal of 
work for the half of all polygons, on av¬ 
erage, that are culled. 


Backface culling with the dot product 
is just a special case of determining which 
side of a plane any point (in this case, the 
viewpoint) is on. The same trick can be 
applied to determine whether a point is 
In front of or behind a plane, where a 
plane is descried by any point that’s on 
the plane (which I II call the "plane ori¬ 
gin”), plus a plane normal One such ap¬ 
plication is in dipping a line (such as a 
polygon edge) to a plane, just do a dot 
product between the plane normal and 
the vector from one line endpoint to the 
plane origin, and repeat for the other line 
endpoint. If the signs of the dot products 
are the same, no clipping is needed; if 
they differ, it is. And yes, the dot product 
is also the way to do the actual clipping- 
hut before we can talk about that, we 
need to understand the use of the dot 
product for projection. 

Using the Dot Product for Projection 

Consider Example 2(b) again, but this 
time making one of the vectors, say V, 
a unit vector. Now the equation reduces 
to Example 5(a). In other words, the re¬ 
sult is the cosine of the angle between 
the two vectors, scaled by the magni¬ 
tude of the nonunit vector. Now, con¬ 
sider that cosine is really just the length 


// Given two line endpoints, a point on a plane, and a unit normal 
// for the plane, returns the point of intersection of the line 
// and the plane in interseetpoint* 

^define DGT_PR0DUCT(x*y) (x [0] *y [0]+x [1] *y [1]+x [2] *y [2]) 

void LinelntersectPlane (float *linestart, float *lineend, 

float *planeorigin, float *planenonnal, float * interseetpoint) 

{ 

float veci[3], projectedlinelength, startdistfromplane, scale: 

vecl[0j = linestart[0] - planeorigin[0] ; 
veci [1] = linestart[l] - planeorigin[l] : 
veci[2] = linestart[2] - planeorigin[2] ; 

startdistfromplane - D0T_PRODUCT{vecl, planenormal): 

if (startdistfromplane == 0) 

£ 

// point is in plane 
interseetpoint[0] = linestart[0] ; 
interseetpoint[1] = linestart[l]; 
interseetpoint[2l = linestart[1]; 
return: 

) 

veci[0] = linestart[0] - lineend[0]; 
vecl[l] - linestart[1] - lineend[1]; 
veci[2] = linestart[2] - lineend[23; 

projectedlinelength = DQT_PRODUCT{veci, planenormal)j 

scale = startdistfromplane / projectedlinelength ; 

interseetpoint[0] = linestart[0] - veci[0] * scale: 

interseetpoint[1] " linestart[1] - veci[1] * scale; 

interseetpoint[2] - linestart[1] - veci[2] * scale; 

) - . ‘ 


Example 6: The intersection point on a line segment . 
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of the adjacent leg of a right triangle, 
think of the nonunit vector as the hy¬ 
potenuse of a right triangle, and re¬ 
member that all sides of similar triangles 
scale equally. What it all works out to is 
that the value of the dot product of any 
vector with a unit vector is the length of 
the first vector projected onto the uniL 
vector, as in Figure 6. 

This unlocks all sorts of neat stuff. 
Want to know the distance from a point 
to a plane? Just dot the vector from the 
point P to the plane origin O p with the 
plane unit normal N p , to project the vec¬ 
tor onto the normal, then take the abso¬ 
lute value, as shown in Example 5(b) 
and Figure 7. 


Want to clip a line to a plane? Calcu¬ 
late the distance from one endpoint to 
the plane, as just described, and dot the 
whole line segment with the plane nor¬ 
mal, to get the full length of the line along 
the plane normal. The ratio of the two 
dot products is then how far along the 
line from die endpoint the intersection 
point is; just move along the line segment 
by that distance from the endpoint, and 
you’re at the intersection point, as shown 
in Example 6. 

Rotation by Projection 

You can use the dot product’s projection 
capability to look at rotation in an in¬ 
teresting way. Typically, rotations are 


represented by matrices, This is certain¬ 
ly a workable representation that en¬ 
capsulates all aspects of transformation 
in a single object, and it is ideal for coo- 
catenations of rotations and translations. 
One problem with matrices, though, is 
that many people, myself included, have 
a hard time looking at a matrix of sines 
and cosines and visualizing what's actu¬ 
ally going on. So when two 3-D experts, 
John Carmack and Billy Zelsnack, men¬ 
tioned that they think of rotation differ¬ 
ently, in a way that seemed more intu¬ 
itive to me, 1 thought it was worth 
passing on. 

Their approach is this: Think of rota¬ 
tion as projecting coordinates onto new 
axes. That is, given that you have points 
in, say, worldspace, define the new co¬ 
ordinate space (viewspace, for example) 
to which you want to rotate by a set of 
three orthogonal unit vectors defining 
the new axes, and then project each 
point onto each of the three axes to get 
the coordinates in the new coordinate 
space, as shown for the 2-D ease in Fig¬ 
ure 8, In 3-D, this involves three dot 
products per point, one to project the 
point onto each axis. Translation can he 
done separately from rotation by simple 
addition. 

Rotation by projection is exactly the 
same as rotation via matrix multiplication; 
in fact, the rows of a rotation matrix are 
die orthogonal unit vectors pointing along 
the new r axes. Rotation by projection buys 
us no technical advantages, so that's not 
what's important here; the key is that the 
concept of rotation by projection, togeth¬ 
er with a separate translation step, gives 
us a new way to look at transformation 
that I, for one. find easier to visualize and 
experiment with. A new frame of refer¬ 
ence for how we think about 3-D frames 
of reference, if you will. 

Three things Eve learned over the years 
are that: 

* It never hurts to learn a new way of 
looking at things. 

* It helps to have a clearer, more intuitive 
model in your head of whatever it is 
you're working on, 

* New r tools, or new ways to use old tods, 
are Good Things, 

My experience lias been that rotation by 
projection, and dot-product tricks in gen¬ 
eral, offer those sorts of benefits for 3-D, 
Next time, we ll do BSP-based render¬ 
ing, and if there's room, maybe I can 
sneak in a sample app that shows some 
smart dot tricks in action. 


DDJf 


Please don't let 
this be the only blood 
you give this year. 




American 
Red Cross 


(jive blood again. It will be felt for a lifetime. . . 
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Rocket Science 
Made Simple 



f m going to explain some important 
stuff about computer architecture, stuff 

J that you really need to know, III cov¬ 
er the Pentium, PowerPC 601, and the 
P6. We have to discuss a few basics be¬ 
fore we come to the important stuff. 

The term “computer architecture” is 
widely mis understood. It has little to do 
with the design of a computer system or 
microprocessor chip. The computer ar¬ 
chitect is best known as the person who 
gets to use a clean piece of paper to de¬ 
fine which instructions the computer will 
be able to execute. But the most impor¬ 
tant job the architect does is decide on 
the lengthCs), in bits, of the computer in¬ 
structions and assign the bit fields within 
that length to perfomi the necessaiy com¬ 
puter operations. 

If there’s a large proportion of unused 
combinations, the architect has done a 
lousy job. But a few should be set aside. 
When Intel designed the 8086, some then- 
undefined combinations later became the 
basis for adding a (very) few more regis¬ 
ters in the 386 generation. 

The 8086 was designed back when 64K 
was a huge memory space and Pascal 
seemed to be taking over the personal- 
computer marketplace. So the 8086 was 
given exactly enough registers to run com¬ 
piled Pascal, 

Because memory was then an extremely 
limited resource, the 8086's basic instruc¬ 
tion-field length was made eight bits, and 
some of Pascal’s most common instruc¬ 
tions (LOOP, for example) were fitted into 
those eight bits; eight bits does not pro¬ 
vide for specifying a lot of registers. 

When the 68000 was designed, larger 
memories were common, so the architect 
selected a 16- bit basic instruction field. 
Two 4-bit register fields were assigned. 
Eight bits, half the 16-bit instruction field, 
went to defining the source and destina¬ 
tion registers. 


Hal is a hardware engineer who some¬ 
times programs. He is the former editor 
of DTACK Grounded and can be con¬ 
tacted through the DDJ offices. 


Hal W. Hardenbergh 


But the ability to place many transistors 
on a single die was exploding, and soon 
32 each, 32-bit register started showing 
up, for instance on David Patterson’s 
Berkeley RISC I design. Five bite are re¬ 
quired to select one of 32 registers. If two- 
address (SRC, BEST) operands were to 
be used, then ten bits of the instruction 
bit field were needed to specify the reg¬ 
isters. That leaves only six bits of a 16-bit 
instruction field, not enough to be useful. 

So computers with 32 registers moved 
up to a 32-bit instruction field. All the 
computer architects made die decision to 
use three-address operands (SRCl t SRC2, 
DEST) and so assigned 15 bits just for reg¬ 
ister selection— again, about half the in¬ 
struction field. 

The microprocessor went from a 
register-starved, 8-bit instruction field in 
1977 to a register-rich, 32-bit instruction 
field in 1982. These architectural deci¬ 
sions were dictated by the then state of 
the chip-fabrication art. Let me repeat— 
diese were architectural decisions. 

And architectural innovations stopped 
right there in 1982, because a personal com¬ 
puter does not (yet) need a 64- bit instruc¬ 
tion field. Yep, Architecture for personal 
computers essentially Froze in 1982. 

How do you upgrade a computer to a 
new architecture? In other words, how do 
you get your hands on more registers 
while continuing to run your dd software? 
The answer is, you don’t. The only way 
to get more registers is to abandon your 
software —all your software— and move 
to a new computer. I understand the MIPS- 
based ACE computer systems (which run 
both UNIX and Windows NT) are partic¬ 
ularly good examples of desktop com¬ 
puters with register-rich environments. 

Oh? You don’t have an ACE system on 
your desktop? You still use, and program 
for t a register-starved computer architec¬ 
ture? Gee. It appears that computer ar¬ 
chitecture, white fundamental, is not im¬ 
portant. 


The personal-computer marketplace 
doesn’t care about architectural hard¬ 
ware issues. The marketplace responds 
to fast and cheap. “Fast” means internal 
caches, floating-point accelerators, su¬ 
perscalar techniques, and the like — 
none of which has anything to do with 
architecture. (The presence or absence 
of an internal cache is independent of 
the instruction field.) 

"Cheap" means economy of scale. 
More than 50 million personal comput¬ 
ers will be sold this year, and to a first- 
order approximation, 100 percent of them 
will be based on the x86 architecture. If 
you want a cheap computer, buy one 
based on the x86. 

But the marketplace still wants to run 
the software It acquired ten years ago. 
Software compatibility is, in fact, an ar¬ 
chitectural issue, and it matters in the mar¬ 
ketplace. 

The people who designed the Pentium 
and the P6 and who are currently de¬ 
signing the P7 are not computer archi¬ 
tects, But they're pretty good engineers, 
based on the results IVe seen. I call them 
“chip designers," 

Back when the world was young and 
children were respectful of their elders, the 
chip designer’s job was simple: The design 
had to execute any instruction as quickly 
as possible. Then it had to execute die next 
instruction as quickly as possible. That’s 
how the 8086, 286 T 336, and 486 work. 

But with the advent of the Pentium, 
those days are gone. The Pentium — 
sometimes—executes more than one in¬ 
struction in the same clock cycle. That 
“sometimes" is pretty important to those 
of you who need to write code that runs 
fast, and has afforded my colleague 
Michael Abrash the opportunity to pub¬ 
lish several articles on optimizing code for 
the Pentium, 

The Pentium is the first x86 generation 
that uses a “superscalar" implementation. 
Let’s compare it to the PowerPC 601, which 
was primarily designed by IBM, with a lit¬ 
tle bus-interface assistance from Motoro¬ 
la. To a first-order approximation, the 60x 
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architecture has 0 percent of die personal- 
computer market. 

The 601 is based on the latest computer 
architecture: the 32-bit model with 32 reg¬ 
isters, Like the Pentium, its implementa¬ 
tion uses superscalar techniques, but not 
those used by die Pentium. The 601 can 
issue up to three instructions each clock 
cycle, one each of integer, floating point 
(fp), and branch. 

You are the software experts, not me, so 
let's pretend you just explained to me diat 
most application programs in the personal- 
computer market execute instructions in 
the ratio 85 percent integer, 0 percent fp, 
and 15 percent branch. This means the 
60Ts ability to simultaneously execute fp 
instructions with integer and branch in¬ 
structions is useless. The only improve¬ 
ment the superscalar 601 offers is the abil¬ 
ity to simultaneously issue integer and 
branch instructions. And since there are 
roughly six times as many integer as 
branch instructions, this isn't terribly use¬ 
ful. In fact, the 601’s superscalar ability 
means that, at best, it can execute 100 in¬ 
structions in 85 docks (assuming one dock 
per instruction). All that superscalar de¬ 
sign effort provides, at best, a 17.6-percent 
performance improvement. 

The Pentium's designers were much 
more crude, tf either an fp or branch in¬ 
struction is issued on a given clock cycle, 
then no other instruction can be issued at 
that time. In practice, this means dial dur¬ 
ing the 15 percent of the time that branch 
instructions are being issued, the Pentium 
ain't superscalar. But in die 85 percent of 
the time that integer instructions are being 
issued, die Pentium can—sometimes—is¬ 
sue two integer instructions on the same 
dock cycle. This means die Pentium can, 
at best, execute a 100-instruction mix (as¬ 
suming one dock per instruction cycle) in 
85/2+15=57.5 clocks— a 73.9 percent per¬ 
formance improvement 

Okay, instructions sometimes need 
more than a single clock to execute, and 
the Pentium cannot always issue two In¬ 
teger instructions in the same dock peri¬ 
od, thus Abrash’s fine articles on opti¬ 
mization, But IntePs chip designers 
focused on improving performance dur¬ 
ing the 85 percent of the time that inte¬ 
ger instructions are being issued, while 
IBM’s designers concentrated their efforts 
on die 15 percent of die time that branch 
instructions were being issued. 

W1 licli design team best earned its pay- 
check? 

I sent a copy of the penultimate draft of 
this article to some folks who used to de¬ 
sign microprocessor chips for a living. One 
of diem, John Wharton, called me back 
and said ‘"Hal, the Pentium doesn't work 
like that!" (The last four digits of John's 


home phone number are 8051, which is 
one of Intel's most popular 8-bit micros.) 

So I was wrong, A Pentium can issue a 
branch instruction after an integer in¬ 
struction in the same dock (but not an in¬ 
teger instruction after a branch instruc¬ 
tion). And under rare circumstances the 
Pentium can issue two FP instructions in 
the same clock— if one of them is an 
FXCH instruction. 

In the pairing rules, a “complex” in¬ 
struction is a microprogrammed instruc¬ 
tion, such as one of the string instructions 
(MOVS or SCAS, for example). When one 
of the integer pipes goes into micropro¬ 
grammed mode, both pipes do. That's 
why only one “complex” instruction can 
be active at a time. 

John also explained floating-point pro¬ 
cessing: 

A cute trick the Pentium designers came 
up with was getting the result of a 64-bit 
FP operation back to the internal cache 
quickly. FP operations use the integer pipes, 
each of which is 32 bits wide. So the Pen¬ 
tium uses both pipes to move 64 bits in 
parallel. It saves one dock and at Pentium 
speeds, one dock is important. 

(The most interesting tiling John told 
me was about the infighting—I call it civ¬ 
il war—over Intel's upcoming P7. But 
dial's another story.) 

The Pentium design team set up two 
on-chip production lines, like Ford us¬ 
ing one line for Escorts and another for 
Taurii. With a budget of 5.5 million tran¬ 
sistors, the P6 design team was able to 
use more advanced techniques. Contin¬ 
uing widi die automotive analogy, the P6 
makes intensive efforts to build a car in 
the shortest time. 

In the P6, we Find a large crowd gath¬ 
ered at die input ends of several parallel 
production lines (pipes), and another 
large crowd at the output ends. 

The input crowd looks for tasks ready 
to proceed and issues diem to one of the 
production lines, [t also looks for tasks 
diat might be ready to proceed and spec¬ 
ulatively issues them, too. A list of 30 tasks 
to select from is kept. 

The crowd at die output accepts and 
temporarily stores all the results the sev¬ 
eral production lines deliver. Not every¬ 
thing that comes off the production lines 
proves to lie useful. Some “product” is ul¬ 
timately discarded. (“We can't use that 
blue trunk assembly on this red car, Fred. 
Throw it aw ay!”) 

A scoreboard keeps track of everything 
that's going on. The P6 has a lot more 
registers than the programmer's model as¬ 
serts, and renames them for efficiency. 
How did Intel’s designers get so smart? 
They probably read Chaitin et al/s tuto¬ 


rial, “Register Allocation via Coloring," 
which is part of die June 1982 SIGPLAN 
Proceedings on compiler construction. 
Yes, tutorial In 1982. You didn't think this 
stuff was new, did you? 

[Abstract: Register allocation may be viewed 
as a graph coloring problem ... Preliminary 
results .. suggest that global register alloca¬ 
tion approaching that of hand-coded as¬ 
sembly language may be attainable] 

Now you should have a grasp of what 
Intel means when it says the P6 uses score- 
boarding techniques and issues instruc¬ 
tions speculatively. Specifically, the P6 
guesses which branch paths will be tak¬ 
en and speculatively executes the instruc¬ 
tions following those branches (assuming 
no data dependencies). If those branches 
are taken, then the instruction results are 
already available. Otherwise, the results 
are discarded. The P6 speculatively exe¬ 
cutes instructions passed up to five (!) 
branches, assuming they T re available in the 
30-instruction queue at the front end. 

The P6 is Intel’s first x86 that does not 
always directly execute x86 instructions. 
If you’ve read Abrash’s articles on Pentium 
optimization, you know the perfonnance 
benefits of breaking some complex in¬ 
structions down into two simpler, yet 
equivalent, x86 instructions. Well, the P6 
takes this a step further The P6 f s instruc¬ 
tion deader will often break a complex 
x86 instruction into simpler instructions, 
that may not be x86 instructions at all. 

Since P6 continually looks at the next 
30 instructions and begins execution of 
each as soon as possible, and automati¬ 
cally breaks up complex instructions 
when beneficial, you won’t have to opti¬ 
mize P6 code. 

The P6 self-optimizes all that shrink- 
wrapped code, no matter what genera¬ 
tion of optimizing compiler was used. 
Poor Michael Abrash! Hell have nothing 
to write about, and the bank will fore¬ 
close his mortgage. 

I Tie philosophical design differences 
underlying the 486, Pentium, and P6 gen¬ 
erations have nothing whatever Lo do with 
computer architecture and everything to 
do with chip design. The best chips are 
designed by persons familiar with hap¬ 
penings in the mainframe and minicom¬ 
puter arenas a dozen or more years back. 

Intel’s Andrew Grove once publicly 
asserted that there wasn't any use for a 
miHion-tramLstor-plus chip except for mem¬ 
ory. If he'd known his x86 chip designers 
would soon be crafting microprocessors 
that performed useless instructions and 
wouldn’t even directly execute x86 code, 
do you suppose he’d have fired them? 

DDJ 
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SOFTWARE AND THE LAW 


Patents: Best Protection 
for Software Today? 



P atents and software used to be words 
that were not spoken of together. 
Today, that’s changing— and fast. 
Recent court decisions have ex¬ 
panded the availability of patents for 
software, Furthermore, the U*S, Patent 
Office has just gotten in line by issuing 
new proposed guidelines for the exam¬ 
ination of “computer-implemented” in¬ 
ventions. The procedures for enforcing 
software patents are also being trimmed 
down to make them less expensive and 
fasten 

At the same time, other forms of legal 
protection for software are becoming less 
attractive. Copyrights provide only limit¬ 
ed protection, Tliis is known only too wdJ 
to Lotus, which recendy was told by an 
appellate court that rite copyrights on its 
famous 1-2-3 spreadsheet program do not 
bar Borland from copying the entire 1-2-3 
menu tree. Trade-secret protection is also 
often lost inadvertently, particularly with 
publicly distributed software. 

Unless you are planning to develop 
your software in a cave during the next 
decade, therefore, you need to be able to 
determine when software can be patent¬ 
ed, hcjw to patent it, and how to deal with 
patent-infringement allegations. Let’s be¬ 
gin by taking take a look at three tests for 
patentability. 

Statutory Subject Matter 

I usually like to leave out the legal jar¬ 
gon in my column. But if you want to 
be knowledgeable in this area, remem¬ 
ber the phrase “Statutory subject matter.* 
This concept addresses whether the “sub¬ 
ject matter” of the invention is listed in 
the “statute” (35 U.S.C. §101) that defines 
the types of inventions entitled to a 


Marc is a patent attorney and sharehold¬ 
er of the in tei/ectual-property law firm of 
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patent. It is this requirement that has de¬ 
veloped into an impediment to software 
patents. 

'rhis statute states that a patent may be 
obtained on "any new and useful process, 
machine, manufacture, or composition of 
matter, or any new and useful improve¬ 
ment thereof.,Software clearly falls 
within at least one of these broad areas. 
Nevertheless, the U.S, Supreme Court re¬ 
fused a patent on a software-driven pro¬ 
cess for converting a BCD number into 
pure binary. In its 1972 decision of 
Gottscbalk v. Benson, the Court said that 
a patent could not be obtained on “laws 
of nature, physical phenomena and ab¬ 
stract ideas* M 

Again in 1978, the Supreme Court re¬ 
fused to Issue a software-related patent, 
this time on a method of updating nu¬ 
merical alarm limits using a computer. In 
Parker v. Flock, the Court expressed con¬ 
cern tiiat such a patent consisted merely 
of a new formula and the computer that 
implemented it. 

But in 1981, the Supreme Court changed 
tacks. In Diamond v , Diebr t the Court ap¬ 
proved a patent on a process for curing 
rubber that implemented a well-known 
mathematical equation (the Arrhenius 
equation) in a computer to calculate op¬ 
timum cure time. Although the Court re¬ 
iterated that laws of nature, natural phe¬ 
nomena, and abstract ideas are not 
patentable, it said that a patent could be 
granted on a practical application of a con¬ 
cept, even if it included a programmed, 
digital computer. 

At about the same time, Congress cre¬ 
ated a new court to handle all appeals in 
patent cases. It is called the “United States 
Court of Appeals for the Federal Circuit.” 
In a series of recent decisions, the Feder¬ 
al Circuit has made clear that patents may 
be granted on a broad variety of inven¬ 
tions containing software. 


Perhaps most important is its 1994 de¬ 
cision in In Re Alappat. Alappat involved 
a software program that implemented a 
series of algorithms to clarify the picture 
that is displayed on an oscilloscope. In 
support of its conclusion that such an in¬ 
vention was “statutory,” the Federal Cir¬ 
cuit stated that the invention was “not a 
disembodied mathematical concept..., but 
rather a specific machine to produce a 
concrete and tangible result.” The court 
noted that a a general-purpose computer 
in effect becomes a special purpose com¬ 
puter once it is programmed to perform 
particular functions pursuant to instruc¬ 
tions from program software....” 

About two weeks later in In Re Warmer- 
dam t the Federal Circuit ruled that a patent 
could be issued in connection with an al¬ 
gorithm that created a data structure that 
controlled moving objects, such as robots, 
so that they would not collide with other 
objects. 

On July 31, 1995, the U.S. Patent and 
Trademark Office followed suit by issu¬ 
ing proposed guidelines for “computer- 
implemented” inventions. “These guide¬ 
lines respond to recent changes in the 
law that govern the patentability of 
computer-implemented inventions,” the 
Office said. The proposed guidelines es¬ 
sentially provided that all software-relat¬ 
ed inventions constitute patentable sub¬ 
ject matter, except when the invention 
falls within one of the following categories: 

• A compilation or arrangement of data, 
independent of any physical elements. 

* A known, machine-readable storage 
medium encoded with data representing 
creative or artistic expression (for exam¬ 
ple, a work of music, art, or literature)* 

■ A “data structure" independent of any 
physical element (that Ls, not as imple¬ 
mented on a physical component of a 
computer such as a computer-readable 
memory to render that component ca¬ 
pable of causing a computer to operate 
in a particular manner)* 
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SOFTWARE AND THE LAW 


■ A process that does nothing more than 
manipulate abstract ideas or concepts 
(for example, a process consisting sole¬ 
ly of the steps one would follow in .solv¬ 
ing a mathematical problem). 

Significantly, none of the excluded cat¬ 
egories appear to embrace a general- 
purpose computer running a software pro¬ 
gram. Does this mean that all software can 
be patented by simply claiming a general- 
purpose computer running the new soft¬ 
ware program? The proposed guidelines 
appear to address this question in the fol¬ 
lowing way: 

[Tjn rare situations, a claim classified as a 
statutory machine or article of manufacture 
may define nonstatulory subject matter. Non- 
statutory subject matter (Le,, abstract ideas, 
laws of nature, and natural phenomena) 
does not become statutory merely through 
a different form of claim presentation. 

Such a claim will (a) define the '"inven¬ 
tion" not through characteristics of the ma¬ 
chine or article of manufacture claimed but 
exclusively in terms of a nonstatutory pro¬ 
cess that is to be performed on or using 
that machine or article of manufacture, and 
Cb) encompass any product in the stated 
class (c,g. t computer, computer-readable 
memory) configured in any manner to per- 
foim that process. 

To avoid this exclusion, it seems as 
though the invention must be defined 
“through characteristics of the machine or 
article of manufacture claimed.” But what 
does this mean? Is this different from the 
second requirement that the invention not 
“encompass any product in the stated 
class?" I really don’t know! But 1 do know 
that the guidelines are intended to con¬ 
form the practices of the Patent Office 
with the more liberal views of the courts. 
If the guidelines are interpreted to pre¬ 
clude patents on all general-purpose com¬ 
puters running new software, that goal 
seemingly will not be reached. 

These guidelines may already have been 
clarified by the time you read this column. 
They were issued in June of this year lor 
""public comment" Final guidelines were 
promised for July 31, 1995. Check my next 
column for an update. 

Novelty 

The second requirement for a patent is 
that the invention be ""novel," This is usu¬ 
ally a very easy test to pass. It simply 
means that the invention is in some way 
new. Any distinction whatsoever from 
what was done before is sufficient. 

An invention’s “novelty” is usually lost 
in the United States if an application for 
patent is not filed within one year after 
certain activity has begun. (Many foreign 
countries have no such grace period.) 


“Nonobviousness” 
does not mean that 
the software has to 
be great or that it 
has to achieve a 
remarkable result 


Three major types of activity usually 
start the one-year dock. 

* When a product embodying the inven¬ 
tion Ls offered for sale by anyone, even 
someone other than the inventor. Hie 
offer need not result in an actual sale. 

* When the invention is described in a 
“printed publication." There is no re¬ 
quirement that the publication be wide¬ 
ly distributed. Indeed, papers dis¬ 
tributed at conferences are usually 
sufficient, 

* When the invention is used for its in¬ 
tended purpose in a nonexperimental 
environment. If you invent a superior 
database system and allow your spouse 
to use iL to manage the groceries, you 
had belter start die one-year dock. Even 
your own use of the invention can start 
the clock running, Hie clock will usu¬ 
ally not begin while the primary pur¬ 
pose of die use Ls to determine whether 
the invention works. 

Nonobviousness 

The third requirement for obtaining a 
patent is that the invention not l>e 'obvi¬ 
ous” in view of what has lieen done be¬ 
fore. This is the only qualitative test drat 
is applied. 

“Nonobviousness" does not mean that 
the softw are has to lie great or achieve a 
remarkable result. IL merely means that 
the underlying concepts of die software 
are not obvious in view of what was 
known before. 

Most software is simply combinations 
of previously knowm routines. In deter¬ 
mining “obviousness," therefore, the real 
question is whether it would have been 
obvious to have combined these routines 
to make the software. 

The determination of “obviousness” is 
necessarily subjective. But several objec¬ 
tive factors will be considered: 


■ The art taught away from the approach 
that die software took. 

* Widespread efforts to obtain the bene¬ 
fits of die software were previously un¬ 
successful. 

* The software has received widespread 
recognidon. 

* The software achieves new and unex¬ 
pected results. 

* The software has l>een a commercial 
success. 

* Nothing in the prior art suggests com¬ 
bining the routines contained in the soft¬ 
ware. 

The Patent-Application Process 

The first step in patenting your software 
is to document it by preparing block di¬ 
agrams, flowcharts, specifications, data- 
structure maps, screen layouts, and the 
like. While source code is obviously die 
most precise formulation of the soft¬ 
ware, it is usually not perfected until 
well after the software is conceived. 
Also, it does not usually communicate 
the broad concepts implemented by the 
software. 

It’s wise to corroborate the date on 
which die documentation was prepared. 
At the very least, it should be signed and 
dated by the developer, ft is also good 
practice to have people not involved with 
the development read, date, and sign the 
documentation. Above each signature, the 
document should state Lhat the witness 
has read and understood the information, 
as well a,s the number of pages it con¬ 
tains, It is also a good idea to keep a 
bound invention notebook and to date 
each new entry, 

A patentability search is usually per¬ 
formed next. Although not required by 
law, it is very useful, as it may reveal that 
the software is not sufficiently different 
from previous work to justify the expense 
of a patent. Even when you are sure about 
the distinctiveness of your new p software, 
knowledge of the closest prior art will help 
to frame the patent application in the 
broadest possible way. 

The next step is to prepare the appli¬ 
cation, A patent application usually con¬ 
tains drawings, a written description of 
the invention, and a set of “claims” (En¬ 
glish descriptions of the invention's ele¬ 
ments). You are not required to build a 
working model of your invention before 
filing the application. 

Unless you have considerable experi¬ 
ence with patents, it is unlikely that you 
will be able to prepare the application 
yourself. Normally, applications are pre¬ 
pared by a patent attorney or patent agent, 
both of whom must first pass a proficiency 
test. But, you can do certain things to as¬ 
sist in its preparation: 
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* You are legally bound to provide the at¬ 
torney or agent with copies or a de¬ 
scription of ihe closest prior art of which 
you are aware. 

* Identify the specific differences—steps, 
components, results—between your 
software and the prior art. 

* Disclose the “best mode” you are aware 
of for implementing your invention. If 
you don't disclose certain features with 
the thought of keeping them secret and 
this is later discovered, your patent will 
be declared invalid. For marketed soft¬ 
ware this is usually easy to prove. To 
easily satisfy the “best mode” require¬ 
ment, provide the Patent Office with a 
complete copy of your source code. If 
this makes you feel uneasy, a patent 
may not be for you! 

* Include sufficient information to enable 
a person of ordinary skill in the an to 
which your invention pertains to make 
and use the invention without undue 
experimentation. Submitting the source 
code will often fulfill this requirement, 
too. The more detail, the better. The 
only downside is the cost of docu¬ 
menting such detail. 

After your application is filed, it will be 
assigned to a Patent Office “examiner"— 
a person knowledgeable in the field of 
your invention and in the principles of 
patent law. 

Approximately six months to a year af¬ 
ter the application is filed, you will receive 
an “office action,” a written response to 
your application from the Patent Office 
examiner. The office action will either al¬ 
low your application or explain why it is 
being rejected. 

Rejected applications may be amended 
to try and overcome the grounds of re¬ 
jection, Alternatively, you can argue that 
die rejection is unjustified. 

A second rejection is usually final You 
will usually have to pay an additional fee 
to submit further amendments or argu¬ 
ments. You can also appeal die examin¬ 
er's final rejection to the Board of Patent 
Appeals and, as thereafter necessary, to 
the federal courts. 

Enforcement 

The first step in enforcing a patent is to 
determine whether die software of the ac¬ 
cused party is infringing. 

Those unfamiliar with patent law usual¬ 
ly don't make this determination correctly. 
Inventors often feel that there is an in¬ 
fringement when the competing software 
incorporates features described in die patent. 
Accused infringers, on the other hand, of¬ 
ten conclude that there i$ no infringement 
because their software does not contain ev¬ 
ery feature described in the patent. 


Neither approach is correct. The draw¬ 
ings and detailed descriptions in a patent 
are usually merely examples of the in¬ 
vention, not the invention itself. Not uti¬ 
lizing every feature in an example does 
not necessarily avoid infringement. Con¬ 
versely, using a few of the features does 
not necessarily imply infringement. 

The true test of infringement is whether 
the software in question contains every 
feature documented in any single claim at 
the end of the patent. This is the rule: A 
patent is infringed when die software in 
question contains every element recited 
in any single patent claim. 

When determining the scope of each 
element, the words describing it should 
be given their ordinary meaning in the art, 
except when a contrary meaning is ex¬ 
pressed in the patent. The words should 
also be given their broadest reasonable 
meaning, not restricted to the specific ex¬ 
amples described in die patent. 

In three circumstances, an infringement 
will be found, even if die infringer's soft¬ 
ware does not contain all of the claim’s 
elements, 

* The software contains the equivalent of 
each missing element, usually a cor¬ 
responding element that performs sub¬ 
stantially the same function in sub¬ 
stantially the same way, to achieve 
substantially the same result. The precise 
reach of this “doctrine of equivalents" is 
expected to be the subject of a decision 
by the Federal Circuit in In Re Hilton. 

* The missing elements are found in the 
computer system in which die infringers 
software is installed. If the software has 
no substantial use other than in a sys¬ 
tem that infringes the patent claim, the 
person making or selling file software 
will usually be liable as a “contributory 
infringer.” 

* The accused infringer did not actually 
commit the infringement, but encour¬ 
aged the person who did by distribut¬ 
ing promotional material or product 
manuals that promote the software as 
useful in a configuration that infringes 
the patent claim. This is known as “in¬ 
ducing infringement.” An officer or em¬ 
ployee of a company w r ho actively par¬ 
ticipates in infringing activity of that 
company can also be held personally 
liable under this theory. 

Charges of infringement should be 
made carefully, as they give the alleged 
infringer the right to sue the person charg¬ 
ing infringement for a “declaratory judg¬ 
ment” that the patent is not infringed or 
is not enforceable. Unless defended at 
typically great expense, the patent could 
be lost. 


ft is particularly risky to charge cus¬ 
tomers of a manufacturer with infringe¬ 
ment, If the claim turns out to be unmer- 
itorious, the person charging infringement 
can be exposed to counterclaims for libel, 
slander, disparagement, interference with 
contract, and violation of the antitrust laws. 

Responding to a charge of patent in¬ 
fringement requires even greater care. All 
too often, file accused infringer denies the 
infringement allegation without having the 
allegation analyzed by a competent patent 
attorney. Be warned: A company that con¬ 
tinues infringing activity without having 
fast received a favorable legal opinion will 
often be assessed treble damages and at¬ 
torney's fees if it loses the case. 

Patent litigation has traditionally been 
very expensive, but this may also be 
changing. The Federal Circuit just held in 
Markman v. WesttHew Investments that dis¬ 
putes over the scope of a patent should 
be determined by a judge, not a jury. 
Ihus, many cases will now be resolved 
far short of an expensive jury trial. 

The right to a jury trial in patent cases 
is also now being questioned. In Ameri¬ 
can Airlines v. Lockwood, the Supreme 
Court has agreed to decide whether the 
U,S, Constitution gives an alleged infringer 
a right to a jury trial. Abolishing jury tri¬ 
als in patent cases entirely would result 
in additional savings. 

In many cases, the alleged infringer is 
aware of prior art that is closer to the in¬ 
vention than that which the Patent Office 
knew about when it was issued. He may 
then challenge the validity of the patent, 
arguing that the invention was “obvious.” 
Although this can be done in court, it 
also can often be done in a separate Pe¬ 
tition for Reexamination in the Patent Of¬ 
fice, Seeking reexamination of the patent 
in the Patent Office is far less expensive 
than court litigation. Unfortunately, the 
alleged infringer is usually not permitted 
to participate during reexamination. 
Therefore, alleged infringers who have 
die financial resources often opt to have 
dieir invalidity allegation determined by 
a court. 

Conclusion 

Considerable dispute continues over the 
type of software entided to a patent. While 
some are arguing, others are applying for 
and receiving software patents. 

Don't be left behind! And remember, 
software can be simultaneously protected 
by a patent and a copyright. Indeed, un¬ 
til the patent is granted (typically, not un¬ 
til at least a year after the application is 
filed), the software can also be protected 
as a trade secret. 

DDJ 
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Observations 
on Observer 


PATTERNS & SOFTWARE DESIGN 


P artitioning a system into objects is 
a key activity during object-oriented 
design. As a result of this partition¬ 
ing, we may create objects that de¬ 
pend on other objects. Changes in one 
object must be reflected into others. There 
are many different ways to ensure that 
these dependencies are maintained. 

For example, consider the timer object 
in Figure 1, which keeps the current time, 
and a digital-display object thaL shows the 
current time. Whenever the timer ticks, 
this time-display object has to be updat¬ 
ed, In other words, the time-display ob¬ 
ject has to maintain the constraint to al¬ 
ways reflect the timer's current time, A 
simple solution is to connect the timer ob¬ 
ject directly to the time-display object. 
Whenever the timer changes, it explicitly 
tells the display object to update itself. Fig¬ 
ure 2 shows the corresponding class dia¬ 
gram for directly coupling the timer with 
its observer. Listing One (listings Ixgin on 
page 63) is one way to implement this in 
C++. While this direct coupling of two ob¬ 
jects is simple to implement, it can also 
introduce problems in different areas: 

* Reusability, It is not possible to reuse die 
timer independently of the time display. 
The two objects are .strongly coupled and 
must always be used together, even when 
the client is only interested in the timer. 
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* Maintainability. The direct coupling 
makes maintenance more difficult. It is 
not possible to test or port the timer to 
a different platform independently of 
the time display, 

* Extensibility. Whenever you want to add 
another kind of timer display (say, an 
analog display) that needs to be syn- 
cluonized with the timer, you have to 
modify the Timer class to also update 
this new kind of time display. 

These problems are clearly not serious 
in this simple example. However, in the 
context of a larger system we have to be 
more alert when introducing dependen¬ 
cies among objects. In fact, it is a key de¬ 
sign activity to control and manage the 
dependencies between objects. Badly 
managed dependencies can result in a tan¬ 
gled system that is hard to reuse, main¬ 
tain, or extend. A common theme of sev¬ 
eral patterns in our book Design Patterns: 
Elements of Reusable Object-Oriented Soft¬ 
ware is how to break hard dependencies 
by decoupling the involved objects. The 
Observer pattern is one of them. 

The intent of the Observer pattern is to 
define dependency relationships between 
objects so that when one changes, its de¬ 
pendents are notified and can update them¬ 
selves accordingly. The Observer pattern 
enables objects to observe and stay syn¬ 
chronized with another object without cou¬ 
pling the observed object with its observers. 
The pattern has two participants: L a sub¬ 
ject, and 2. the subject's dependent ob¬ 
servers. Each time the subject changes, it 
is responsible for notifying its observers 



that it changed. Observers must ensure that 
whenever they are notified, they in turn 
make themselves consistent with their sub¬ 
ject. A subject needs an interface that al¬ 
lows observers to subscribe and register 
their interest in changes to the subject. 

The subject usually maintains a list of 
subscribed observers. Figure 3 illustrates 
these class relationships in OMT notation. 
Notice that we introduced two new base 
classes. The Subject class defines the 
mechanism for registering and notifying 
observers and the Observer class defines 
the update interface. This diagram illus¬ 
trates hew the Observer pattern breaks 
the direct coupling l^etween Subject and 
Observer, The Subject knows nothing 
about its Observers except that they can 
be sent Update requests. This Is because 
the reference from Subject points to the 
abstract class Observer. For this reason, 
we refer to this kind of coupling as “ab¬ 
stract" The abstract coupling lx:tween Sub¬ 
ject and Observer resolves the problems 
mentioned previously: 

* Reusability. The timer object can new 
be reused and distributed without the 
time-display object. It only has to be 
bundled with the abstract Observer 
class. Figure 4 illustrates the resulting 
class coupling, 

* Maintainability, 'fhe objects are no bnger 
directly coupled, and the timer object 
can be tested independently of the time 
display. For example, when you port 
the class to another platform, you 
can test it as soon as you’ve ported it 
and Observer, You no longer have to 
wait for the testing until the time dis¬ 
play and its associated graphical infras¬ 
tructure are ported as well. 

* Extensibility. It is easy to add addition¬ 
al objects that need to be synchronized 
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PATTERNS & SOFTWARE DESIGN 


with the timer. For example, an analog 
time display only needs to inherit from 
Observer, implement the Update inter- 
face, and register itself with the timer. 

Figure 4 shows that the concrete ob¬ 
server knows the class of the subject it is 
observing, and it can rely on this interface 
to query the subject’s current state* 

The Subject and Observer classes (List¬ 
ings Two and Three, respectively) illus¬ 
trate how you could implement Observer 
in the context of the timer example. The 
key point about Timer is that its Tick 
member function calls Notify, which will 
call update on all its Observers, 

Listing Four presents the classes Ob¬ 
server and DigitalTimeDisplay. Digit al- 
TimeDisplay maintains a reference to the 
timer. Whenever the Timer ticks, it calls 
Notify, which in turn calls Update on its 
attached Observer's, In this case, Digital - 
Timefdisplay receives the Update request, 
reads the time from the timer, and displays 
the time. 

Notice how the Timer has no knowl¬ 
edge of how It is displayed. In fact, you 
could add another timer, say an Analog- 
TimeDispiay, and it would also be up¬ 
dated whenever the Timer ticked. 

Absorbing the Subject and 
Observer Classes 

One possible simplification of the canon¬ 
ical observer-class structure Ls to absorb 
the Subject and Observer classes into ex¬ 
isting classes. For example, the Microsoft 
Foundation Classes (MFC) use this kind 
of simplification. MFC supports multiple 
views observing a document (see “Adding 
Auxiliary Views for Windows Apps/ 1 by 
Robert Rosenberg, Dr Dobb's Sourcebook 
of Windows Programming, March/April 
1995). In MFC, the subject functionality is 
absorbed into Document, and Observer is 
absorbed into the View class. This solu¬ 
tion Ls simpler since it requires fewer class¬ 
es, but the dependency relationships can 
only be defined between instances of Doc¬ 
ument an d Views, 
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Figure 1: Simple timer object . 


Inheritance Variations 

There are also several variations in how 
inheritance is used to implement the Ob¬ 
server pattern. As Figure 2 shows, the in¬ 
terfaces for notifying and observing are 
defined by two classes. The Observer base 
class defines an interface consisting of an 
Update operation. In a language support¬ 
ing multiple inheritance, Observer is often 
not a primary base class (that is, it is mixed 
in as an auxiliary base class). For exam¬ 
ple, in the timer example, AnalogTime- 
Disptay might need to inherit from a graph¬ 
ical base class like View. In this case, 
A nalog TimeDisplay mixes in the Observer 
class as an auxiliary class. Another varia¬ 
tion is not to separate the notifying and 
observing interfaces into two separate 
classes. For example, in the Smalltalk-80 
implementation of Observer, these two in¬ 
terfaces are supported by the universal Ob¬ 
ject class. Thus, each object in the system 
can act as both subject and observer. This 
is particularly convenient in a language 
that does not support multiple inheritance 
or in class libraries tliat don't want to rely 
on it. Using separate classes for subject 
and observer would require you to inher¬ 
it from lioth subject and observer when an 
object needs to act as both. 

In Figure 3, the Timer subclass inher¬ 
its the Subject interface without any over¬ 
riding, This is not always the case* For 
example, it is passible that a Subject sub¬ 
class wants to customize how observers 
are maintained In Smalltalk-80 the Sub¬ 
ject base class ( Object ) implements the 
subject interface in a space-efficient way. 
Instead of storing the list of observers in 
each subject, the subject/observer map¬ 
ping is maintained in a central dictionary. 
Only subjects that actually have observers 
are stored in the dictionary and have to 
pay for the subject service. However, this 
approach trades space for time: Access¬ 
ing a subject's observers requires a dic¬ 
tionary look-up. For subjects that often 
notify observers, eliminate this inefficiency 
by storing the observers directly in an in¬ 
stance variable. The subject interface can 
then be implemented by accessing this 
list directly. In Smalltalk-SO, tills kind of 
Subject implementation is provided by 
the Object subclass Model Consequent¬ 
ly, the client has the choice between sub¬ 
ject implementations with different trade¬ 
offs by inheriting from either Object or 
Model As an aside, the Subject interface 


is an example of so- called "coupled over¬ 
rides/' If you override one of the subject 
operations, you should also override the 
others* 

Push versus Pull Update Protocols 

In the timer example, die inner makes no 
assumptions about w r hat objects are ob¬ 
serving it* Instead it relies on the various 
timer displays querying it to retrieve the 
current time. Tire observers “pull” the state 
of the subject to them. An alternative is 
for the timer to send, or “push,” the time 
to its observers whenever it updates them. 
Rushing the time requires extending the 
interface of Observers to accept the time 
in seconds. To do this, you replace the 
Observer dass w ith a TimerObsewer class; 
see Listing Six* The TimerSuhject class 
would now have to maintain a List of 
TimerObservers and its Notify function 
would look like Listing Seven* The ob¬ 
servers are now r more tightly coupled to 
the timer, but they no longer need to 
query the timer for the time. It is still pos¬ 
sible to have arbitrary Timer observers by 
subclassing from TimerObsewer. Howev¬ 
er, TimerSuhject and TimerObservers can 
no bnger be used to maintain general de¬ 
pendency relationships. 

The decision to use push or pull up¬ 
date protocols depends on many trade¬ 
offs: die amount of data being pushed and 
the expense of pushing it, the difficulty of 
determining what changed in the subject, 
the cost of notification and subsequent 
updates (whether subjects and observers 
are in the same address space), and de¬ 
pendencies introduced by observers be- 
ing dependent on the pushed data* 

The push model is more appropriate 
when editing text, Consider an imple¬ 
mentation where a TextSuhject stores the 
textual data and a TextView acting as its 
observer presents the text in window. 
When the user changes the text by enter¬ 
ing a character, the pull model requires 
that the TextView completely reformat the 
text and refresh the window or that it 
somehow can determine which range of 
characters really changed. Both of these 
operations can be quite time consuming, 
A more satisfactory approach is for the 
TextSuhject to provide a “hint" of its 
changed text. The TextView uses this hint 
to update itself more efficiently. Hints can 
he simple, enumerated constants that pro¬ 
vide general indications of what changed 
in die Subject, or more sophisticated, spe¬ 
cific information to aid the Text View. 
TextView is interested in how the TextSub- 
ject changed—whether characters were 
added or removed, and where. 

The hint can package information about 
the actual changes (“deleted range 12-279 
and push it to the observers. The hint es- 
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Figure 2: Class diagram for directly coupling the timer with its observer 
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sentially sends the deltas that have oc¬ 
curred in the subject. In practice, not all 
observers will be interested in every hint; 
they may ignore some and act as if they 
had received a simple update request, 

A hint can be extended with addition¬ 
al information by making it a first-class 
object. This enables subjects to bundle the 
additional information by subclassing from 
a Hint base class. At die receiving end, 
the observer downcasts die hint to die de¬ 
sired type and extracts the additional in¬ 
formation. This downcast should of course 
not l>e done in a “hard* way, and it should 
by guarded by using the C++ run-time 
type identification facilities {dynamic_cm£). 

Who Send s Out Notifications 

When notifications are sent, the subject 
must be in a consistent state. If it is not, 
strange results may occur in the observers 
as they try to update themselves from a 
nonsensical subject. 

Which object has the responsibility to 
actually send the notification is also im¬ 
portant. In our example, it is the Timer 
object that sends notifications in its Tick 
operation. This works fine as long as die 
subject is simple. 

In Listing Eight, the Tick operation is 
overridden in a special kind of timer that 
allows you to set alarms. See the prob¬ 
lem? When subjects have the responsibil¬ 
ity to send notifications, overriding oper¬ 
ations in the subject may cause spurious 
and inconsistent notifications. By overrid¬ 
ing Tick the first notification (sent from 
Timer:: Tick ) is sent while the Alarmed- 
Timer is in an inconsistent state (the 
_aiarm variable should be set at that time 
but isn't until the second notification). 
Some observers could set off alarms by 
testing the result of AlarmSet and some 
could do so by testing the equality of 
AlarmTime and CurrentTime. 

There are simple fixes for this problem. 
But for more complex subjects with de¬ 
rived classes, overriding operations that 
send notifications in the subject could 
make the subject inconsistent or cause du¬ 
plicate notifications to be sent. 

One solution to sending notifications to 
die subject is making clients change die 
subject to initiate the notifications. When¬ 
ever the client makes a change, it must 
call Notify on the subject. This solution is 
practical, but places extra burden on the 
clients. It is easy to forget to call Notify on 
the subject. Another solution is to define 
Tick as a template method (see the Tem¬ 
plate Method pattern from our book) that 
first just calls the operation DoTick and 
Notify. The Timer class defines DoTick to 
increment the current time. Subclasses can 
override this operation to provide their 
own extensions to the Tick operation. 


Subscribing to Specific Aspects of 
Observers 

When a subject has complex internal state, 
observers may spend much effort to de¬ 
termine exactly w hat changed in the sub¬ 
ject. Along with using hints, you can re¬ 
duce this burden by relying on intrinsic 
properties of die subject itself. Complex 
subjects may only change their state in 
predefined ways. Changes in part of a sub¬ 
ject's state may be independent of changes 
in other parts. 

Such properties can be exploited by 
having the subject define independent as¬ 
pects and having observers only subscribe 
to the aspects they are interested in. 

In the Timer example, suppose that the 
Timer class were implemented with three 
distinct counters that maintained the time 
in seconds, minutes, and hours. Now the 
various timer displays will usually be de¬ 
fined in terms of hours, minutes, and sec¬ 
onds. Clearly, not all of these need to be 
updated each second. In fact, the hour, 
minute, and second counters change al¬ 
most independently of each other. You 
can exploit this by defining aspects that 
represent changes in hours, minutes, and 
seconds and defining our displays as con¬ 
sisting of three independent parts, each 
subscribing to a particular part of the 
Timer. 

In this example, assume that the aspects 
are simply defined as integer constants 
that are passed as a parameter to Notify; 
see Listing Nine. The class Timer makes 
its aspects available to the clients as class- 
scoped constants. In Listing Ten, for ex¬ 
ample, the changed aspect is passed as a 
hint to the Observer's Update operation in 
listing Eleven. If there are many different 
aspects, the update operation becomes a 
lengthy conditional statement that maps 
an aspect to a piece of code. Such con¬ 
ditional code Is not very elegant There 
are different techniques to avoid this kind 
of manua 1-dispatching code. One tech¬ 
nique is demonstrated in Visual Works 


Smalltalk, wherein the dispatching prob¬ 
lem is solved with a DependencyTrans- 
former object that implements the Ob¬ 
server interface. It knows which aspect it 
is interested in and keeps track of the ac¬ 
tual receiver of the notification and the 
operation to be executed by the receiver 
when the aspect changes. Figure 5 shews 
a possible class structure for a Depen- 
dencyTransformer, 

When a DependencyTransformer re¬ 
ceives the update, it checks the aspect If 
the aspect matches, DependencyTra ns- 
former invokes the operation on the Re¬ 
ceiver. DependencyTmnsformtTs are cre¬ 
ated by the subject when an observer 
expresses its interest in a changed aspect. 
This requires a way to specify the opera¬ 
tion to be called. In Smalltalk, the opera¬ 
tion's selector name is specified; #up- 
dateSeconds, for example. 

DependencyTransformer s act as an in¬ 
termediary between the subject and its de¬ 
pendent object. They map the Observer 
interface to an operation of the dependent 
object. A DependencyTranfyormer is there¬ 
fore an example of the Adapter pattern. 

Making an Arbitrary Class a 
Subject 

Sometimes classes are not designed to 
be subjects, but later, you realize that in¬ 
stances of these classes might have de¬ 
pendent objects. How do you make such 
classes into subjects? You could change 
the class by mixing in the Subject inter¬ 
face, but this is not always possible. The 
class you wish to make a subject may 
not be modifiable—it may reside in a 
class library over which you have no 
control. 

An elegant way to allow arbitrary 7 class¬ 
es to become subjects is to wrap the ob¬ 
ject in another object that adds the 
Subject behaviors and interfaces. This dec¬ 
orator object (this is an example of the 
Decorator pattern) intercepts and forwards 
all requests to the wrapped object, and 



Figure 3: Class relationships in OMT notation. 
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Figure 5: Possible class structure fora Dependency Trans Former. 


notifies clients after operations which are 
likely io change the wrapped object 

Suppose tlie class Timer was in fact not 
designed to lx* a subject and was defined 
as in Listing Twelve. You could make 
Timer a subject by defining a TimerDec- 
orator class as in Listing Thirteen, The 
TimerDecorator has the same interface as 
die Timer so it looks like a timer to clients. 
Every request made of the 1'imerDecora¬ 
tor is forwarded on to the timer, and dien 
the timer calls Notify on itself to update 
its observers; see Listing Fourteen. 

Making objects be subjects by using 
decorators is only practical when the dec¬ 
orated object's interface is relatively small, 
because you have to duplicate the sub¬ 
ject’s interface in the decorator. Tf the sul> 
ject’s interface is large, this approach can 
become unwieldy. 

Conclusion 

In one form or another, the Observer pat¬ 
tern occurs in many object- oriented sys¬ 
tems. Wliile most commonly used for de¬ 
coupling user interfaces from data to be 
displayed on the user interface, often it is 
used to manage dependencies between 


objects. The Observer pattern has many 
more possible variations than the few 
weVe examined. For example, we did not 
look at batching notifications, concurren¬ 
cy and distribution, or observing more 
than one subject, 

Finally, a description of the Observer 
pattern would not be complete without 
mentioning its origin in the Smalltalk’s 
ModefView-Contrdler (MVC) framework. 
In this design, the Model encapsulates 
application data. The View presents the 
model to the user. The controller is re¬ 
sponsible for handling user input. From a 
dependency-management view, MVC pro¬ 
vides the idea of decoupling the appli¬ 
cation data from die user interface. The 
benefit of this decoupling is that the ap¬ 
plication data can be presented by differ¬ 
ent user interfaces. In MVC terminology, 
the timer object becomes the “model” and 
the time display becomes a “view." If we 
supported manipulation of the time dis¬ 
play by the user, then this behavior would 
be assigned to the Controller. 
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DiaplayTim*( _tliner->CurrentTimeQ ): 


Listing One (Text begins on page 59) 

class DigitalTimeDieplay; 
class Timer f 
public; 

TimeriDigitalTimeDisplay*) i 
long CurrcntTimeO const: 
void Tic3t() J 
private: 

DigitalTimeDisplay* -display: 
long cur rentTima; 


class DigltalTimeDiaplay ( 
public: 

DigitalTimeDispiayO : 
void DisplayTimtO: 

Void dpdateTiiHE(long time); 

li 

void Timer: :Tick() 

{ 

.currentTime*"*-; 

-display-iUpdateTime Lour rentTime); 

Listing Two 

class Subject { 
public: 

void Attach(Observer*); 
void Detach(Observer*)i 
void NotifyO; 
protected: 

Subject(): 
private: 

List<Observer*> *_obflerver 5 : 

); 

void Subject::Notify 0 E 

Liattterator<Obaerver*> i(-observers); 
for (i.FirstO; li.IsDoneO; i.NextQ ) { 
i. Cur rent I tu(J -JlfpdateO: 


Listing Three 


Listing Six 

class TimerObaerver ( 
public: 

virtual Void Update(long) “ 0; 
protected: 

TlraerObserverO: 

3; 

Listing Seven 

void TijnerSubject: :Notify (long time) 

UiatIter ate r <TinerOb server *> i(-observers); 
for (i.FirstO; li.lBDoneO: i.NeittO ) ( 
i.CurrentIttm() - > Update(tIne): 


Listing Eight 

class AlannedTiflier : public Timer t 
public: 

Alarme dTimer £); 
virtual void Tick(); 
long AlarmTirneQ; 
bool AlarmSotO: 
private; 

long -fllarmTiine: 
bool ^alarm; 

U 

Ala rmedTime r:iA1a rmedrimer() 

: alatmTime (0), .alarm (fdlfl*) 
l 
3 

void AlannedTimer; ;TicV () 

E 

Timer: :TickU: 

if ( CurrentTiaeO — .alamTine ) C 

-alarm = true; 

3 else ( 

-alarm » false: 

3 


class Timer : public Subject C 
public: 

Timer 0: 

virtual void TicfcO; 

long CuxrantTime0 
const : 
private: 

long -CurrentTime; 


void Timer: :TickU 

£ 

_currentTime++; 

Notify[J: 

) 

Listing Four 

class Observer { 
public: 

virtual Void Update() = 0; 
protected: 

Observer(); 

li 


class DigltalTincDisplay : public Observer t 
public: 

Di glta ITimeU i splay (Tine r*) : 
virtual void Update0: 
void DljjplayTimeUong time) : 
private: 

Timer* -timer; 

3: 

DigitalTimeDiaplay:;DigitalTimeDispleyETImer* t) s .timer(t) 

) 



void DigitalTimeDieplay: :UpdateO 

DisplayTimet -timers Cur rentTimeO ); 

3 

Listing Five 

elasa AnalogTimeDisplay : public Observer E 
public: 

Anflla gTinueDi splay (Time r*) : 
virtual void UpdateO; 
void DieplayTimfitlong time): 
private; 

Timer* -timer: 

J: 


AnalogTimeDisplay::AnalggTimeBisplay(Timer * t) : _tiroer(t) 

E 

3 


void AnalogTimeDii splay: :Update() 
E 


Increase credibility, professionalism, and sales 
with beautiful four-color, reprints of your latest 
product review, article, or column from 
Dk Dobbs Journal 



• Media Kits 

'V■ Marketing /Public Relations 
-Trade Shows 
• Seminars /Workshops 
s -Sales Tools 


Can Sara Wood at (415) 655-4193 


Dr. Dobb’s Sourcebook, September/October 799.5 


63 































OBSERVER 


(continued from page 63) 

Listing Nine 

class Subject [ 

void Notify tint aspect); 


Listing Ten 

claas Timet: public Subject [ 
public; 

//,*. 

static const int ASPECT.SE-CtlNDS: 
static const int ASPECT.MINUTES; 
static const int ASPECT.HOURS; 

tf... 

int SecondsU eanat: 
int Minutest) const; 
int Hours() const; 
private: 

int _bec unde: 
int .minutes; 
int Jiours: 


void Timex: ;Tick(J 
t 

.seconds = ++_seconds % 60: 
Notify(ASPECT,SECONDS); 
if ( Seconds = 0 ] [ 

.minuted = ++.minutes ^ 60: 
Not ify{ ASPECT JITNUTES); 


if ( .seconds ==0 && .minutes *■ 0 ) [ 
.hours ■ ++^baurs IE 24: 

Notif y(ASPECT.HOUES): 


3 


private: 

Timer* -timer: 

J; 

void Ana1ogTimebisplay::Update(Int aspect] 

if (aspect — Timer::ASPECT.SECONDS} 

// update second hand ... 
else if (aspect == Timer::ASPECT .MINUTES) 
// update minute hand *.. 
else if (aspect == Timer::ASPECT-HOURS) 

// update hour hand , r , 
else 

// full Update 


Listing Twelve 

class Timer ( 
publict 

virtual void TickQ ; 
long CurrantTimeO const; 
private: 

long .currentTime; 

J; 

void Timer; ;Tick(J 
( 

_curran,tTime++; 

Listing Thirteen 

clsaa TifnarDecorator : public Timer, public Subject 1 
public: 

Tima rDec o ra t or(Timer *)j 
virtual void Ticfct); 
private: 

Timer* -timer; 

): 

Listing Fourteen 


Listing Eleven 

cl ACa AnalogTimeDifiplay : public Observer 
public: 

AnaiogTimeDisplay(Timer*) ; 
virtual void Update(int aspect}; 
void DisplayTime(long time): 


void TimerDccoratox::Tick () 

( 

-Cimer-mckO; 

notifyo; 

) 

End Listings 
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Building Internet And Other Native PowerPC 
Applications Has Never Been Easier Or Faster. 


SmalltalkAgents® 

SmalltalkAgents (STA) is a 
sophisticated rapid applica¬ 
tion development environ¬ 
ment based on a new gen¬ 
eration of the Smalltalk lan¬ 
guage, enabling you to easily deliver double- 
dickabie applications. 

Copland Style GUI Look ft Feel 




Agmts {feject System 

MACINTOSH 


design and project elements in a “Finder- 
like'’ desktop workspace as fluidly as you 
work with folders and documents on your 
desktop. Interactively build, wire, and inter 
connect reusable components and inter¬ 
faces in an integrated environment. 


GUI Design ft Generation 

Live “Drag and Drop" manipulation to build 
your application’s visual interface using 
components that “know" how to behave 
and autoconfigure themselves into an envi¬ 
ronment. Create new components and/or 
wire together existing components that can 


z 


A _ m Ediiai 


m 



unnamed 


This is some 
text in a text 
(tat we 
can edit as we 


be saved as reusable template desigas for 
use in other applications or containers. 


Creating professional quality user interfaces 
is easy with our component parts libraries. 

VisualWorkbench 

Visually manipulate all objects including 
source and design elements using your 
mouse and keyboard. Visually manage 



DTP Engine ft Word Processor 

STA not only includes a programmable 
word processor component and 
HyperMcdia engine, but also a powerful 
report writer that supports embedding of 
any kind of objects, movies, flows, and inter¬ 
national text, and page layout, 

C/C++, Pascal Workbench 

Compile, edit, and dynamically link C/C++, 
Pascal, Fortran, and Assembly ctxle directly 
from within our STA VisualWorkbench as an 
integrated part of the Smalltalk application 
development process. 

Component-based Architecture 

STA components are designed for OpenDoc 


and OLE, and will give you transparent inte¬ 
gration with OpenDoc and OLE when they 
become available. 

Threading ft Internet Tools 

STA provides powerful support for Internet 



server as well as client tcxrl development. 
Pre-emptive threading, thread safe libraries 
and classes for TCP/IP protocols are stan¬ 
dard features enabling you to quickiy and 
easily deliver custom e-mail, WWW, list-serv¬ 
er, and other dial-up/network related apps. 


PowerPC Support 

STA provides binary portability across differ¬ 



ent CPUs and Operating Systems. Design 
applications today on one platform and sim¬ 
ply deploy on other platforms as required. 


For further information, please contact 
Susan, Dept. D at 1-800-296-1339 or at 
< info@qks.com > or visit our Web site 
http://www.qks,com/. 
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Tel: (301) 530-4853 Fax: (301)530-5712 
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